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Abstract 


Probabilistic graphical models (PGMs) provide a compact and flexible framework to model 
very complex real-life phenomena. They combine the probability theory which deals 
with uncertainty and logical structure represented by a graph which allows to cope with 
the computational complexity and also interprete and communicate the obtained know- 
ledge. In the thesis we consider two different types of PGMs: Bayesian networks (BNs) 
which are static, and continuous time Bayesian networks which, as the name suggests, 
have temporal component. We are interested in recovering their true structure, which 
is the first step in learning any PGM. This is a challenging task, which is interesting 
in itself from the causal point of view, for the purposes of interpretation of the model 
and the decision making process. All approaches for structure learning in the thesis 
are united by the same idea of maximum likelihood estimation with LASSO penalty. 
The problem of structure learning is reduced to the problem of finding non-zero coefficients 
in the LASSO estimator for a generalized linear model. In case of CTBNs we consider 
the problem both for complete and incomplete data. We support the theoretical results 


with experiments. 
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Chapter 1 


Introduction 


1.1 Motivation 


It is a common knowledge that we live in the world where data plays crucial role in many 
areas and applications of great importance for our society and the importance of data is 
still growing. The amount of data in the world is now estimated in dozens of zettabytes, 
and by 2025 the amount of data generated daily is expected to reach hundreds of exa- 
bytes. There is a demand for models and algorithms that can deal with these amounts 
of data effectively finding useful patterns and providing better insights into the data. On 
top of it, most environments require reasoning under uncertainty. Probabilistic graphical 
models (PGMs) provide such a framework that allows to deal with these and many other 
challenges in various situations. The models combine the probability theory which deals 
with uncertainty in a mathematically consistent way, and logical structure which is repre- 
sented by a graph encoding certain independence relationships among variables allowing 
to cope with the computational complexity. 

PGMs encode joint distributions over a set of random variables (often of a significant 
amount) combining the graph theory and probabilities, which allows to represent many 
complex real-world phenomena compactly and overcome the complexity of the model 
which is exponential in the number of variables. There are also some other advantages 
that these models have. Namely, because of their clear structure, PGMs enable us to 
visualize, interprete and also communicate the gained knowledge to others as well as 
make decisions. Some models, for example Bayesian networks, have directed graphs in 
their core and offer ways to establish causality in various cases. Moreover, graphical 
models allow us not only to fit the observed data but also elegantly incorporate prior 
knowledge, e.g. from experts in the domain, into the model. Besides, certain models take 
into account a temporal component and consider systems’ dynamics in time. 

Graphical models are successfully applied to a large number of domains such as image 
processing and object recognition, medical diagnosis, manufacturing, finance, statistical 
physics, speech recognition, natural language processing and many others. Let us briefly 
present here a few examples of various applications. 


Bayesian networks, one of the PGMs considered in this thesis, are extensively used in 


the development of medical decision support systems helping doctors to diagnose patients 
more accurately. In the work by Wasyluk et al. (2001) the authors built and described a 
probabilistic causal model for diagnosis of liver disorders. In the domain of hepatology, 
inexperienced clinicians have been found to make a correct diagnosis in jaundiced patients 
in less than 45% of the cases. Moreover, the number of cases of liver disorders is on 
the rise and, especially at early stages of a disease, the correct diagnosis is difficult yet 
critical, because in many cases damage to the liver caused by an untreated disorder 
may be irreversible. As we already mentioned and as it is stressed out in the work 
above, a huge advantage that these models have is that they allow to combine existing 
frequency data with expert judgement within the framework as well as update themselves 
when the new data are obtained, for example patients data within a hospital or a clinic. 
What is also important in the medical diagnosis is that PGMs, Bayesian networks in 
particular, efficiently model simultaneous presence of multiple disorders, which happens 
quite often, but in many classification approaches the disorders are considered to be 
mutually exclusive. The overall model accuracy, as the authors Wasyluk et al. (2001) 
claim, seems to be better than that of beginning diagnosticians and reaches almost 80%, 
which can be used for the diagnosis itself as well as the way to help new doctors to 
learn the strategy and optimization of the diagnosis process. A few other examples of 
the PGMs application in medical field are management of childhood malaria in Malawi 
(Bathla Taneja et al. (2021)), estimating risk of coronary artery disease (Gupta et al. 
(2019)), ete. 

The next popular area of graphical models application is computational biology, for 
example Gene Regulatory Network (GRN) inference. GRN consists of genes or parts of 
genes, regulatory proteins and interactions between them and plays a key role in medi- 
ating cellular functions and signalling pathways in cells. Accurate inference of GRN for 
a specific disease returns disease-associated regulatory proteins and genes, serving as po- 
tential targets for drug treatment. Chen and Xuan (2020) argued that Bayesian inference 
is particularly suitable for GRNs as it is very flexible for large-scale data integration, 
because the main challenge of GRNs is that there exist hundreds of proteins and tens 
of thousands of genes with one protein possibly regulating hundreds of genes and their 
regulatory relationship may vary across different cell types, tissues, or diseases. More- 
over, the estimation is more robust and easier to compare on multiple datasets. Chen 
and Xuan (2020) demonstrated this by applying their model to breast cancer data and 
identified genes relevant to breast cancer recurrence. As another example in this area, 
Sachs et al. (2005) used Bayesian network computational methods for derivation of causal 
influences in cellular signalling networks. These methods automatically elucidated most 
of the traditionally reported signalling relationships and predicted novel interpathway 
network causalities, which were verified experimentally. Reconstruction of such networks 
might be applied to understanding native-state tissue signalling biology, complex drug 
actions, and dysfunctional signalling in diseased cells. 

The use of probability models is extensive also in computer vision applications. In their 


work Frey and Jojic (2005) advocate for the use of PGMs in the computer vision problems 


requiring decomposing the data into interacting components, for example, methods for 
automatic scene analysis. They apply different techniques in a vision model of multiple, 
occluding objects and compare their performances. Occlusion is a very important effect 
and one of the biggest challenges in computer vision that needs to be taken into account, 
and PGMs are considered to be a good tool to handle that effect. PGMs are also used 
for tracking different moving objects in video sequences, for example long-term tracking 
of groups of pedestrians on the street (Jorge et al. (2007)), where the main difficulties 
concern total occlusions of the objects to be tracked, as well as group merging and split- 
ting. Another example is on-line object tracking (Jorge et al. (2004)) useful in real time 
applications such as video surveillance, where authors overcame the problem of needing 
to analyze the whole sequence before labelling trajectories to be able to use the tracker 


on-line and also the problem of unboundedly growing complexity of the network. 


1.2 Probabilistic Graphical Models 


In the previous subsection we described the advantages of PGMs and why one might be 
interested in studying them. In this work we focus on two types of PGMs: Bayesian 
Networks (BN) and Continuous Time Bayesian Networks (CTBN). The first term has 
rather long history and tracks back to 1980s (Pearl (1985)) whereas the second term is 
relatively modern (Nodelman et al. (2002)). The underlying structure for both models 
is a directed graph, which can be treated either as a representation of a certain set of 
independencies or as a skeleton for factorizing a distribution. In some cases the directions 
of arrows in the graph can suggest causality under certain conditions and allow not only 
the inference from the data but also intervene into the model and manipulate desired 
parameters in the future. BNs are static models, i.e. they do not consider a temporal 
component, while in CTBNs as the name suggests we study models in the context of 
continuous time. The framework of CTBNs is based on homogeneous Markov processes, 
but utilizes ideas from Bayesian networks to provide a graphical representation language 
for these systems. 

A broad and comprehensive tutorial on existing research for learning Bayesian net- 
works and some adjacent models can be found in Daly et al. (2011). The subject of 
causality is extensively explored in Spirtes et al. (2000) and Pearl (2000), some references 
are also given in Daly et al. (2011). Several examples of the use of BNs were presented 
above. 

In contrast to regular Bayesian networks, CTBNs have not been studied that well yet. 
The most extensive work concerning CTBNs is PhD thesis of Nodelman (2007). Some 
related works include for example learning CTBNs in non-stationary domains (Villa and 
Stella (2018)), in relational domains (Yang et al. (2016)) and continuous time Bayesian 
network classifiers (Stella and Amer (2012)). As an example, CTBNs have been suc- 
cessfully used to model the presence of people at their computers together with their 
availability (Nodelman and Horvitz (2004)), for dynamical systems reliability modeling 


and analysis (Boudali and Dugan (2006)), for network intrusion detection (Xu and Shel- 
ton (2008)), to model social networks (Fan and Shelton (2012)), to model cardiogenic 
heart failure (Gatti et al. (2012)), and for gene network inference (Stella et al. (2014) or 
Stella et al. (2016)). 


1.3 Overview of the thesis and its contributions 


There are several problems within both the BN and CTBN frameworks. Both of them 
have graph structures which need to be discovered and this is considered to be one of the 
main challenges in the field. This thesis is dedicated exclusively to solving this problem 
in both frameworks. Another problem is to learn the parameters of the model: in the 
case of BNs it is a set of conditional probability distributions and in the case of CTBNs 
it is a set of conditional intensity matrices (for details see Chapter 2). The last problem 
is the statistical inference based on the obtained network (details are in Chapter 3). 

The thesis is constructed as follows. In Chapter 2 we provide all the necessary pre- 
liminaries for better understanding the frameworks of Bayesian networks and continuous 
time Bayesian networks. Next, in Chapter 3 we overview known results on learning net- 
works’ parameters as well as inference to fully cover the concept of interest. Chapter 4 is 
dedicated to the structure learning problem for BNs, where we provide novel algorithms 
for both discrete and continuous data. Chapters 5 and Chapter 6 cover the problems 
of structure learning for CTBNs in cases of complete and incomplete data, respectively. 
Finally, Chapter 7 concludes the thesis with the summary and the discussion of obtained 
results. 

Algorithms in both Chapters 4 and 5 lean on feature selection in generalized linear 
models with the use of LASSO (Least Absolute Shrinkage and Selection Operator) penalty 
function. It relies on the idea of penalizing the parameters of the model, i.e. adding or 
subtracting the sum of absolute vales of the parameters of the model with some hyper- 
parameter, in order to better fit the model and perform a variable selection by forcing 
some parameters to be equal to 0. The term first appeared in Tibshirani (1996). More 
on the topic of LASSO can be found for example in Hastie et al. (2015). In Section 2.6 
we provide a short description of the concept. 

The main contributions of the thesis are collected in Chapters 4, 5 and 6 and they are 


as follows: 


e we provide the novel algorithm for learning the structure of BNs based on penalized 


maximum likelihood function both for discrete and continuous data; 


e we present and prove the consistency results for the algorithm in case of continuous 
data; 


e we compare the effectiveness of our method with other most popular methods for 
structure learning applied to benchmark networks of continuous data of different 


sizes; 


e we provide the novel algorithm for learning the structure of CTBNs based on pena- 
lized maximum likelihood function for complete data and present two theoretical 


consistency results with proofs; 


e we provide the novel algorithm for learning the structure of CTBNs based on pe- 
nalized maximum likelihood function for incomplete data where the log-likelihood 
function is replaced by its Markov Chain Monte Carlo (MCMC) approximation due 
to inability to express it explicitly; 


e we present and prove the convergence of the proposed MCMC scheme and the 


consistency of the learning algorithm; 


e for the mentioned above MCMC approximation we designed the algorithm to pro- 
duce necessary samples; 


e in both cases of complete and incomplete data we provide results of the simulations 
to show the effectiveness of proposed algorithms. 


Part of the content (Chapter 5) in its early stages has been published on arXiv: 

Shpak, M., Miasojedow, B., and Rejchel, W., Structure learning for CTBNs via pe- 
nalized maximum likelihood methods, arXiv e-prints, 2020, https: //doi.org/10.48550/ 
arXiv. 2006 .07648. 


Chapter 2 
Preliminaries 


In this chapter we provide theoretical background on Bayesian networks (BNs), Markov 
processes, conditional Markov processes and continuous time Bayesian networks (CTBNs). 
We start with the notation common for BNs and CTBNs which we will use through 
the whole thesis. Then we provide a few basic definitions needed to define and under- 
stand the concepts of BNs and CTBNs with their interpretation and examples. Most of 
the contents of this chapter comes from the Nodelman et al. (2002), Nodelman (2007), 
Koller and Friedman (2009). 


2.1 Notation 


First, by upper case letters, for example, X;, B, Y, we denote random variables. In 
the case of CTBNs upper case letters represent the whole collection of random variables 
indexed by continuous time, hence in this case X;(t), Y(t) are random variables for par- 
ticular time points t. 

Values of variables are denoted by lower case letters, sometimes indexed by numbers 
or otherwise representing different values of the same random variable - e.g. xi, s, s. 
The set of possible values for a variable X is denoted by Val(X) and by |X| we will 
denote the number of its elements. 

Sets of variables are denoted by bold-face upper case letters - e.g. X - and correspon- 
ding sets of values are denoted by bold-face lower case letters - e.g. a or x. The set of 
possible values and its size is denoted by Val(X) and |X]. 

A pair G = (V,€) denotes a directed graph, where V is the set of nodes and € is 
the set of edges. The notation u > w means that there exists an edge from the node u to 
the node w. We will also call them arrows. The set V \ {w} is denoted by —w. Moreover, 
we define the set of the parents of the node w in the graph G by 


pag(w) = {ue V : uw. 


When there is no confusion, for convenience we sometimes write pa(w) instead of pag(w). 


Other useful and relevant locally notation we provide in the corresponding sections. 


2.2 Bayesian networks 


In this section we provide an overview of Bayesian networks (BNs). We start with the in- 
tuition behind BNs followed by the representation of BNs together with its formal defini- 
tion and notation. The problems of inference and learning for BNs are considered more 
thoroughly in Section 3.2 and Chapter 4 respectively. 

The goal is to represent a joint distribution p over some set of random variables 
X = {X,...,X,}. Even in the simplest case where these variables are binary-valued, 
the joint distribution requires the specification of 2” — 1 numbers - the probabilities of 
the 2” different assignments of the values {z1,..., £n}. The explicit representation of 
the joint distribution is hard to handle from every perspective except for small values 
of n. Computationally, it is very expensive to manipulate and generally too large to store 
in computer memory. Cognitively, it is impossible to acquire so many numbers from a 
human expert; moreover, most of the numbers would be very small and would correspond 
to events that people cannot reasonably consider. Statistically, if we want to learn the 
distribution from data, we would need ridiculously large amounts of data to estimate so 
many parameters robustly (Koller and Friedman (2009)). 

Bayesian networks help us specify a high-dimensional joint distribution compactly by 
exploiting its independence properties. The key notion behind the BN representation is 
conditional independence, which on the one hand allows to reduce amount of estimated 
parameters significantly and on the other hand, allows to avoid very strong and naive 
independence assumptions. 


Definition 2.1. Two random variables X and Y are independent (denoted by X LY) 
if and only if the equality 


P(X € A, Y € B)=P(X € A)P(Y € B) 


holds for all Borel sets A,B CR. 


For short, we will write it in the form P(X,Y) = P(X)P(Y). There is also another 
way to think of independence. If the random variables X and Y are independent, then 
P(X €-|Y)=P(X €.-). Intuitively, this says that having evidence about Y does not 
change the distribution of our beliefs on the occurrence of X. 

If we wish to model a more complex domain represented by some set of variables, 
it is unlikely that any of the variables will be independent of each other. Conditional 
independence is a weaker notion of independence, but it is more common in real-life 


situations. 


Definition 2.2. Two random variables X and Y are conditionally independent given 
a set of random variables C (symbolically X L Y | C) if and only if 


P(XE€A,Y€B|C)=P(X€A| OPY €BIC) (2.1) 


holds for all Borel sets A,B CR. 


Obviously (2.1) implies 
P(X €A|C,Y)=P(X €A| OC), 
which can be written shortly as 
P(X |C,Y)=P(X |©). 


So intuitively, the influence that X and Y have on each other is mediated through 
the variables in the set C. It means that, when we have some evidence about variables 
from C, having any additional information about Y does not change our beliefs about X. 
Let us demonstrate this definition on a simplified example. Let X be a random variable 
representing the case if a person has lung cancer and Y representing the case if the same 
person has yellow teeth. These variables are not independent as having yellow teeth is 
one of the secondary symptoms of lung cancer. However, when we know that the person 
is a smoker knowing that they have yellow teeth does not give us any additional insight on 
lung cancer, and vice versa, as we consider smoking to be the reason of both symptoms. 

It is easier to believe that in a given domain most variables will not directly affect most 
other variables. Instead, for each variable only a limited set of other variables influence it. 
This is the intuition which leads to the notion of a Bayesian network B over a set of random 
variables B which is a compact representation of a specific joint probability distribution. 


The formal definition is as follows. 
Definition 2.3. A Bayesian network B over a set of random variables B is formed by 


e a directed acyclic graph (DAG) G whose nodes correspond to the random variables 
B; € B,i=1,...,n. 


e the set of conditional probability distributions (CPDs) for each B;, specifying the 
conditional distribution P(B; | pag(B;)) of Bi as a function of its parent set in G. 


The CPDs form a set of local probability models that can be combined to describe 
the full joint distribution over the variables B via the chain rule: 


n 


P(B, Bo, ..., Bn) = | [ P(B: | pag(Bi)). (2.2) 
i=1 

The graph G of a Bayesian network encodes a set of conditional independence assump- 
tions. In particular, a variable B € B is independent of its non-descendants given the set 
of its parents pag(B). See for example Figure 2.1 of an Extended Student network taken 
from Koller and Friedman (2009). As it can be seen, each variable is connected only to 
a small amount of other variables in the network. In this example according to (2.2) the 

joint distribution takes the following form: 


P(C,D,1,G,S,L, J, H) = 
= P(C)P(D | C)P(NP(G | D, DECS | DEC | OPRU | L, S)P(H | G, J). 
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Coherence 


Figure 2.1: The Extended Student network 


This example will be considered in more detail further in the thesis. 

Now we discuss basic structures for BNs including some examples and give the inter- 
pretation of the structures. BNs represent probability distributions that can be formed 
via products of smaller, local conditional probability distributions (one for each variable). 
If the joint distribution is expressed in this form, it means that the independence assump- 
tions for certain variables are introduced into our model. To understand what types of 
independencies are described by directed graphs for simplicity let us start from looking at 
BN B with three nodes: X, Y, and Z. In this case, B essentially has only three possible 
structures, each of which leads to different independence assumptions. 


e Common parent, also called common cause. If G is of the form X + Y > Z, and Y 
is observed, then X L Z | Y. However, if Y is unobserved, then X £ Z. Intuitively 
this stems from the fact that Y contains all the information that determines the 
outcomes of X and Z; once it is observed, there is nothing else that affects these 
variables’ outcomes. The case with smoking and lung cancer described above is such 


an example of common cause. See the illustration (c) in Figure 2.2. 


e Cascade, or indirect connection. If G is of the form X — Y — Z, and Y is 
observed, then, again X L Z | Y. However, if Y is unobserved, then X / Z. Here, 
the intuition is again that Y holds all the information that determines the outcome 
of Z; thus, it does not matter what value X takes. In Figure 2.2 in (a) and (b) 


there are shown cases of indirect causal and indirect evidential effects, respectively. 


e V-structure or common effect, also known as explaining away. If G is of the form 
X + Y + Z, then knowing Y couples X and Z. In other words, X L Z if Y is 
unobserved, but X £ Z | Y if Y is observed. See the case (d) in Figure 2.2. 


The last case requires additional explanation. Suppose that Y is a Boolean variable 


(a) (b) (c) (d) 


Figure 2.2: The four possible two-edge trails from X to Y via Z: (a) An indirect causal 
effect; (b) An indirect evidential effect; (c) A common cause; (d) A common effect. 


that indicates whether our lawn is wet one morning; X and Z are two explanations for it 
being wet: either it rained (indicated by X), or the sprinkler turned on (indicated by Z). 
If we know that the grass is wet (Y is true) and the sprinkler did not go on (Z is false), 
then the probability that X is true must be one, because that is the only other possible 
explanation. Hence, X and Z are not independent given Y. 

To generalize this for a case of more variables and demonstrate the power but also 
the limitations of Bayesian networks we will need the notions of d-separation and J-maps. 
Let Q, W, and O be three sets of nodes in a Bayesian network B represented by G, 
where the variables O are observed. Let us use the notation (p) to denote the set of all 
independencies of the form (Q L W | O) that hold in a joint distribution p. To extend 
structures mentioned above to more general networks we can apply them recursively over 
any larger graph, which leads to the notion of d-separation. 

Recall that we say that there exists an undirected path in G between the nodes u 
and w if there exists the sequence v1,...,Un E€ V such that vi > Vi+ı or vi + vi+ı for each 
i =0,1,...,n, where vp = u and Uni; = w. Moreover, an undirected path in G between 
Q € Q and W € W is called active given observed variables O if for every consecutive 
triple of variables X,Y, Z on the path, one of the following holds: 


e common cause: X + Y - Z and Y ¢ O (Y is unobserved); 
e causal trail: X + Y — Z and Y ¢ O (Y is unobserved); 
e evidential trail: X + Y + Z and Y ¢ O (Y is unobserved); 


e common effect: X + Y + Z and Y or any of its descendants are observed. 
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X, X4 


Xə X2 
\ X X6 
Xi Pas 6 Xı 
P 
X3 Xs X3 Xs 


Figure 2.3: An example for d-separation: Xı and Xe are d-separated given X2,X3 (left), 
Xə, X3 are not d-separated given X1, X¢ (right). 


Finally, we say that Q and W are d-separated given O if there are no active paths bet- 
ween any node A € Q and B € W given O. See examples for d-separation in Figure 2.3. 
In the second example there is no d-separation because there is an active path which 
passes through the V-structure created when Xe is observed. The notion of d-separation 
lets us describe a large fraction of the dependencies that hold in our model. It can be 
shown that if Q and W are d-separated given O, then Q L W |O. 

We will write 1(G) = {(Q L W | O) : Q, W are d-separated given O} to denote the 
set of independencies corresponding to all d-separations in G. If p factorizes over G, then 
I(G) C I(p) and p can be constructed easily. In this case, we say that G is an J-map 
for p. In other words, all the independencies encoded in G are sound: variables that are 
d-separated in G are conditionally independent with respect to p. However, the converse 
is not true: a distribution may factorize over G, yet have independencies that are not 
captured in G. 

So an interesting question here is whether for the probability distribution p we can 
always find a perfect map I(G) for which (G) = I(p) or not. The answer is no (see 
an example from Koller and Friedman (2009)). Another related question is whether 
perfect maps are unique when they exist. This is not the case either, for example, DAGs 
X — Y and X + Y encode the same independencies, yet form different graphs. In a ge- 
neral case we say that two Bayesian networks B,, By are -equivalent if their DAGs encode 
the same dependencies [(G,) = I(G2). For a case of three variables we can notice that 
graphs (a), (b) and (c) in Figure 2.2 encode the same dependencies, so as long as we 
do not turn graphs into V-structures ((d) is the only structure which encodes the de- 
pendency X / Y | Z) we can change directions in them and get J-equivalent graphs. 
This brings us to a fact that if G,,G2 have the same skeleton (meaning that if we drop 
the directionality of the arrows, we obtain the same undirected graph) and the same 
V-structures, then [(G,) = I(G2). For the full proof of this statement, other previously 
made statements and more information about BNs see Koller and Friedman (2009). 
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2.3 Continuous Time Markov Processes 


In this section we collect auxiliary results on Markov processes with continuous time. We 
can think of a continuous time random process X as a collection of random variables 
indexed by time t € [0, 00). It is sometimes more convenient to view X across all values 


of t as asingle variable, whose values are functions of time, also called paths or trajectories. 


Definition 2.4. The Markov condition is the assumption that the future of a process is in- 
dependent of its past given its present. More explicitly, the process X satisfies the Markov 
property iff P(X (t+ At) | X(s),0 < s < t) =P(X(t+At) | X(t)) for allt, At > 0 (Chung 
and Walsh (2005)). 


In this thesis we focus on Markov processes with finite state space which are basically 
defined by initial distribution and a matrix of transition intensities. The framework of 
CTBNs is based on the notion of homogeneous Markov processes in which the transition 


intensities do not depend on time. 


Definition 2.5. Let X be a stochastic process with continuous time. Let the state space 
of X be Val(X) = {21,%2,...,un}. Then X is a homogeneous Markov process if and 
only if its behavior can be specified in terms of an initial distribution PX over Val(X) and 


a Markovian transition model usually presented as an intensity matrix 


=q 12 --- GIN 
qı =Q .-- QN 

Cote = a (2.3) 
qNı N2 «+--+ —4dN 


where qi = ii qi; and all the entries q; and qij are positive. 


Intuitively, the intensity q; gives the “instantaneous probability” of leaving state x; and 
the intensity qi; gives the “instantaneous probability” of the jump from a; to zj. More 
formally, for i Æ j 


Jim P(X (t+ At) = z; | X(t) = ai) = qyAt + O(At’), (2.4) 
and for alli =1,...,N 
Jim P(X(t + At) = 2; | X(t) = 2;) = 1—qAt+ O(A?’). (2.5) 


Therefore, the matrix Qx describes the instantaneous behavior of the process X and also 
makes the process satisfy the Markov assumption since it is defined solely in terms of its 
current state. 

The instantaneous specification of the transition model of X induces a probability 
distribution over the set of its possible trajectories. To see how the distribution is induced, 


we must first recall the notion of a matrix function. 
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Definition 2.6. The matrix exponential for a matrix Q is defined as 


exp Q = D Fr 
k=0 ` 


Now the set of Equations (2.4) and (2.5) can be written collectively in the form 


Jim P(X(t + At) | X(t)) = Jim, exp(Qx At) (I+ QxAt+ O(At?)). (2.6) 


= A 
So given the matrix Qx we can describe the transient behavior of X(t) as follows. If 
X(0) = x; then the process stays in state x; for an amount of time exponentially dis- 
tributed with parameter qi. Hence, the probability density function f and the corres- 
ponding distribution function F for the time when X(t) remains equal to x; are given by 


f(t) =qexp(-qait), t= 0, 
F(t) = 1 — exp(—qit), t20. 


The expected time of changing the state is 1/q;. Upon transitioning, X jumps to the 
state «; with probability qi;/qi for j Æ i. 

Example 2.7. Assume that we want to model the behavior of the barometric pressure B (t) 
discretized into three states (bı = falling, bo = steady, and b3 = rising). Then for instance 


we could write the intensity matrix as 


—0.21 0.2 0.01 
Qs = | 0.05 —0.1 0.05 
0.01 0.2 —0.21 


If we view units of time as hours, this means that if the pressure is falling, we expect that it 
will stop falling in a little less than 5 hours (1/0.21 hours). It will then transition to being 
steady with probability 0.2/0.21 ~ 0.95 and to falling with probability 0.01/0.21 ~ 0.0476. 

When the transition model is defined solely in terms of an intensity matrix (as above), 
we refer to it as using a pure intensity parameterization. The parameters for an N state 
process are {qi, qi; EQx, 1<i j <N, i Fj}. 

This is not the only way to parameterize a Markov process. Note that the distribution 
over transitions of X factors into two pieces: an exponential distribution over when the 
next transition will occur and a multinomial distribution over where the process jumps. 


This is called a mized intensity parameterization. 


Definition 2.8. The mized intensity parameterization for a homogeneous Markov pro- 
cess X with N states is given by two sets of parameters 


and 

Ox = {6ij, 1< j < N, iF 5}, 
where qy is a set of intensities parameterizing the exponential distributions over when 
the next transition occurs and @x is a set of probabilities parameterizing the distribution 


over where the process jumps. 
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To relate these two parametrizations we note the following theorem from Nodelman 


(2007). 


Theorem 2.9. Let X and Y be two Markov processes with the same state space and 
the same initial distribution. If X is defined by the intensity matriz Qx given by (2.3), 


and Y is the process defined by the mized intensity parameterization qy = {q),---,qdy} 

and Oy = {0i Aj}, then X and Y are stochastically equivalent, meaning they have the 

same state space and transition probabilities, if and only if q; = qi for alli =1,...,N and 
, Mij 


for all 1 < 4,4 <N, +73. 


2.4 Conditional Markov Processes 


In order to compose Markov processes in a larger network, we need to introduce the 
notion of a conditional Markov process. This is an inhomogeneous Markov process where 
the intensities vary with time, but not as a direct function of time. Rather, the intensities 
depend on the current values of a set of other variables, which also evolve as Markov 
processes. 

Let Y be a process with a state space Val(Y) = {y1, Y2, ---; Ym}. Assume that Y evolves 
as a Markov process Y(t) whose dynamics are conditioned on a set V of variables, each 
of which can also evolve over time. Then we have a conditional intensity matrix (CIM) 


which can be written as 


—q (V) q(V) ---  dim(V) 

qal V) —q@(V) ---  dm(V) 
Qyiv = ‘ f e. ; 

Qmi(V) Qm2a(V) --- —am(V) 


Equivalently, we can view CIM as a set of intensity matrices Qy, one for each instantiation 
of values v to the variables V, see Example 2.10. Since the framework of CTBNs which 
we consider in the thesis has a graph at its core, we will refer to the set of variables V as 
the set of parents of Y and denote it by pag(Y). Note that if the parent set pag(Y) is 
empty, then CIM is simply a standard intensity matrix. Just as a regular intensity matrix, 
CIM induces the distribution of the dynamics of Y given the behavior of pag(Y) = V. 
If V takes the value v on the interval [t,t +£) for some £ > 0, then as in Equation (2.6) 


im, P(Yi+at | Yn v) = iim, exp(Qy|,At) = iim, (I + QypAt + O(At?’)) . 


If we specify an initial distribution of Y, then we have defined a Markov process whose 
behavior depends on the instantiation v of values of pag(Y). 


Example 2.10. Consider a variable E(t) which models whether or not a person is eating 
(e, = not eating, e2 = eating) conditioned on a variable H(t) which models whether or 
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not the person is hungry (hı = not hungry, hg = hungry). Then we can specify exemplary 
CIM for E(t) as 


10 —10 


—0.01 0.01 
Qik, = | | 


2 2 
ane = Fe aa 


For instance, given this model, we expect that a person who is hungry and not eating 
is going to start eating in half an hour. Also, we expect a person who is not hungry and 
is eating to stop eating in 6 minutes (1/10 hour). 


2.5 Continuous time Bayesian networks 


In this section we define the notion of CTBN, which in essence is a probabilistic graphical 
model with the nodes as variables, the state evolving continuously over time, and where 
the evolution of each variable depends on the state of its parents in the graph. 

Before the formal definition we recall an example from Nodelman et al. (2002). Con- 
sider the situation in medical research where some drug has been administered to a patient 
and we wish to know how much time it takes for the drug to have an effect. The answer 
to this question will likely depend on various factors, such as how recently the patient ate. 
We want to model the temporal process for the effect of the drug and how its dynamics 
depends on other factors. In contrast to previously developed methods of approaching 
such a problem (e.g. event history analysis, Markov process models) the notion of CTBN 
introduced by Nodelman et al. (2002) allows the specification of models with a large struc- 
tured state space where some variables do not directly depend on others. For example, 
the distribution of how fast the drug takes effect might be mediated through how fast 
it reaches the bloodstream, which in turn may be affected by how recently the person 
ate. Figure 2.4 shows an exemplary graph structure for CTBN modelling the drug effect. 
There are nodes for the uptake of the drug and for the resulting concentration of the drug 
in the bloodstream. The concentration is also affected by how full patient’s stomach is. 
The drug is supposed to alleviate joint pain, which may be aggravated by falling pressure. 
The drug may also cause drowsiness. The model contains a cycle, indicating that whether 
the person is hungry depends on how full their stomach is, which depends on whether or 
not they are eating. 

Let G = (V,€) denote a directed graph with possible cycles, where V is the set of 
nodes and € is the set of edges. Further in the context of probabilistic graphical models 
we use the terms “nodes” and “random variables” interchangeably. For every w € V 
we consider a corresponding space %, of possible states at w and we assume that each 
space Xy is finite. We consider a continuous time stochastic process on a product space 
X = [Icy tw, s0 a state s € X is a configuration s = (Sw)wev, where sy E Xu. EW CY, 
then we write sw = (Sw)wew for the configuration s restricted to the nodes in W. We 
also use the notation Yw = [],,<yw Æw, so we can write sy E€ Xw. In what follows 
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Barometer Concentration 


Figure 2.4: (a) 


we use the bold symbol s to denote configurations belonging to ¥ only. All restricted 
configurations will be denoted with the standard font s. 

Now suppose we have a family of functions Qu : Vpag(w) X (Xw X Tw) — [0,00). 
For a fixed c € Vpag(w) we consider Q,(c;-,- ) as a conditional intensity matrix (CIM) 
at the node w (only off-diagonal elements of this matrix have to be specified, the dia- 
gonal ones are irrelevant). The state of CTBN at time t is a random element X(t) 
of the space ¥ of all the configurations. Let X,,(¢) denote its w-th coordinate. The 
process {(Xw(t))wey : t > 0} is assumed to be Markov and its evolution can be described 
informally as follows: transitions, or jumps, at the node w depend on the current confi- 
guration of its parents. If the state of any parent changes, then the node w switches to 
other transition probabilities. If s,, 4 s/,, where s,,s5/, E€ Xw, then 


P(Xy(t + dt) = sy | X-w(t) = sw, Xw(t) = sw) = Qu(Spag(w)s Sw, Sw) dt. 


Definition 2.11. A continuous time Bayesian network N over a set of random variables 
X = {X,,...,X,} is formed by two components. The first one is an initial distribution PZ 
specified as a Bayesian network B over X. The second component is a continuous transi- 


tion model, specified as 


e a directed (possibly cyclic) graph G whose nodes correspond to the random vari- 
ables X;; 


e a conditional intensity matrix Qx;\pag(x,), specifying the continuous dynamic of each 


variable X; given its parents’ configuration. 
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Essentially, CTBN is a Markov jump process (MJP) on the state space ¥ with tran- 


sition intensities given by 


wlSpaclw); Sw Sy), IÉ S-w = S, and Su Æ sl, for some w, 
Gega eres 7 (2.7) 


0, if s_y A S, for all w, 


for s Æ s’. Obviously, Q(s,s) is defined “by subtraction” to ensure that }> Q(s,s’) = 0. 


For convenience, we will often write Q(s) = —Q(s,s) so that Q(s) > 0. In particular, 
Qu (Cj Sw) = — D Qu(c; 8, 8’). 

It is caper at to note that we make a fundamental assumption in the construction 
of the CTBN model: two variables cannot transition at the same time (a zero in the 
definition of Q(s,s)). This can be viewed as a formalization of the view that variables 
must represent distinct aspects of the world. We should not, therefore, model a domain in 
which we have two variables that functionally and deterministically change simultaneously. 
For example, in the drug effect network, we should not add a variable describing the type 
of food, if any, a person is eating. We could, however, change the value space of the 
“Eating” variable from a binary “yes/no” to a more descriptive set of possibilities. 

Further we omit the symbol G in the indices and write pa(w) instead of pag(w). For 
CTBN the density of a sample trajectory X = X(|0,7]) on a bounded time interval [0, T] 


decomposes as follows: 


p(X) = v(X(0)) [] (Xu || Xpat) » (2.8) 


wEV 


where v is the initial distribution on ¥ and p(Xw || Xpa(w)) is the density of piecewise 
homogeneous Markov jump process with the intensity matrix equal to Qu (c; +,- ) in every 
time sub-interval such that Xpa(w) = c. Below we explicitly write an expression for 
the density p(X || Xpa(w)) in terms of moments of jumps and the skeleton of the process 
(Xu, Xpa(w)), as in (2.8), where by skeleton we understand the sequence of states of 
the process corresponding to the sequence of moments of time. 

Let T” = (t8... ,t2,...) and TP = (gpa) a Ea ...) denote moments of jumps 
at the node w € V and at parent nodes, respectively. By convention, put tj’ = ppatw) = 


— paw) 


+1 restigi = tmax. Analogously, SY and SP?) denote the corresponding 


and tire 
pa(w) ,pa(w) 

te sty ds 
j =0,1,...|7P?2™| such that Xpa(w) is constant and X, is homogeneous in each segment. 
Next we define sets J; = {i > 0: ide < tP < pan with notation Jpeg and Jena for 


skeletons. Thus we divide the time interval [0, tmax] into disjoint segments | 
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the first and the last element of J;. Then we obtain the following formula. 
P(Xw || Xpatwy) = p(T”, S™ || SP), TPAM)) — 


|rPpaw) | 
= I] {a F 0) p Ga ae EF s; ) x 


j=0 i€l; 


x I] exp (=e = ‘a OPC ria <p) x 


ic Ij {beg} 


e Ea T 8h 0) — CFI — a S) 


+I = M exp (60 — PG (5 Eo 


Below in Figure 2.5 there is an example of a trajectory of the node w with two possible 
states and of its parent with also two possible states 0 and 1. In this case the sets of 
indices are Ip = {2,3,4}, h = {0} and D = {7}. 


gpa) =f gba) 1 
1 ————— — 
: pa(w) 
: go 
0 ; ` ; 
: s =l s= 1 
1 — 
0 so = 0 i s7 = 0 
u g : , ; 
to oe ae ite te t7 tY t 
pa(w) pa(w) pa(w) pa(w) 
to t ty t3 


Figure 2.5: An exemplary trajectory of a node w and its parents pa(w). 


In consequence, using the fundamental property of the exponential function we may 
write p(Xw || Xpa(w)) in the form 


P(Xu || Xpaw)) = I] I] I] Qu(c; s,s’) Ss) exp [-Qu(c; s,s’)ti(c; s)], 
CE Xpa(w) SEXw oe 
(2.9) 


where 


e n? (c; s,s’) denotes the number of jumps from s € Xy to s’ € Xu at the node w on 
the time interval [0,7], which occur when the parent configuration is c E€ Xpa(w), 
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e t1 (c;s) is the length of time that the node w is in the state s E€ %, on the time 
interval [0,7], when the configuration of parents is c€ E€ Xpa(w)- 


To simplify the notation we omit the upper index T in n7(c; s,s’) and t? (c; s) further in 


the thesis, except for the part where we consider martingales. 


2.6 The LASSO penalty 


In this section we shortly describe the notions of the LASSO penalty and LASSO estima- 
tors which constitute the base of the novel algorithms for structure learning in the thesis. 
LASSO is the acronym for Least Absolute Shrinkage and Squares Operator. The term 
was invented by Tibshirani (1996) though the general concept was introduced even earlier. 
Most of the contents of this section come from Hastie et al. (2015). 

The underlying idea of the LASSO estimators is the assumption of sparsity. A sparse 
statistical model is one in which only a relatively small number of parameters (or predic- 
tors) play an important role. Consider a linear regression model with N observations y; of 
a target variable and x; = (@j,..., Zip)! of p associated predictor variables which are also 
called features. The goal is to predict the target from the predictors for future data and 
also to discover which predictors are relevant. In the linear regression model we assume 
that 

p 
Yi = Po + ` Oty + €i, 
j=1 
where 3 = (Bo, 61,- -, Bp) is the vector of unknown parameters and e; is an error term. 


The standard way to find 6 is to minimize the least-squares function 


N p 2 
` (vi — fo — X Bizu) . 
i=1 j=1 

Typically all of the estimates appear to be non-zero, which complicates the interpretability 
of the model especially with a high number of possible predictors. Moreover, since the data 
have noise the model will try to fit the training observations too much and the parameters 
will most probably take extreme values. In case when p > N the estimates are not even 
unique, so most of solutions will overfit the data. 

The solution is to regularize the estimation process, i.e. add some constraints on the pa- 
rameters. The LASSO estimator uses /;-penalty, which means that we minimize the least- 
square function with an additional bound on ¢;-norm of 8, namely |||], = X5- |8;l < t. 
The value t is the user-specified parameter usually called hyperparameter. The motiva- 
tion to use f;-penalty instead of any other ¢,-penalty comes from the fact that if t is small 
enough we obtain a sparse solution with only a small amount of non-zero parameters. This 
does not happen for @,-norm if q > 1, and if q < 1 the solutions are sparse but the problem 
is not convex. Convexity simplifies the computations as well as the theoretical analysis of 


the properties of the estimator. This allows for scalable algorithms capable of handling 
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problems with even millions of parameters. Before the optimization process we typically 
standardize the predictors so that each column is centred, i.e. the mean for each column 
is 0, and has unit variance, i.e. the mean of squares is equal to 1. We also centre the target 
column, so in the result we can omit the intercept term (p in the estimation process. 
The LASSO penalty is used not only in linear regression but in a wide variety of 
models, for example generalized linear models where the target and the linear model are 
connected through some link function. Hence, in a more general case we can formulate 


the optimization problem as 


Ê = argmin{L(0,D) + Allêlh], 
OER? 
where £L(0, D) is the arbitrary loss function for the data D and the parameter vector 0. 
The tuning hyperparameter À corresponds to the constraining value t, there is one-to- 
one correspondence. This is so-called Lagrangian form for the LASSO problem described 
above. 

In the setting of structure learning for Bayesian networks, both static and continuous, 
we formulate the problem as an optimization problem for a linear or generalized linear 
model, where the parameter vectors encode the dependencies between variables in the 
network. We use the LASSO penalty in all the formulated problems, hence the prob- 
lem of finding arrows in the graph reduces to recovering certain non-zero parameters in 
the LASSO estimator. As the loss functions we use the negative log-likelihood function 


and the residual sum of squares. 
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Chapter 3 


Statistical inference for networks with 


known structure 


There are three main classes of problems concerning Bayesian networks (both static and 
continuous time). The first one is to discover the structure of the network. Namely, we 
need to specify the underlying graph of the network, which nodes are the variables of 
interest, and its edges encode the dependencies between the variables. This problem will 
be covered in subsequent chapters. 

The second problem is to learn the parameters of the network. Namely, knowing 
the structure of the network we need to specify the behaviour of the network in any 
specified node given the states of its parents. In the context of static BN this behaviour 
is encoded by conditional probability distributions (CPD, see (2.2)). The corresponding 
parameters in case of CTBNs are conditional intensity matrices (CIM, see (2.7)). 

The third type of problems is to make statistical inference using the network with 
known structure and parameters. For instance, we may want to predict the state of some 
node of interest or, knowing states of some nodes, find which combination of the remaining 
nodes explains them the best. Finally, we may be interested in prediction of the future 
dynamics (in time) of some nodes of the network. 

In this chapter we discuss well known results concerning the problems of learning 
the parameters of the network and then the inference based on the fully discovered net- 
work. The contents of this chapter are mainly based on Koller and Friedman (2009), 
Nodelman (2007) and Heckerman (2021) with more detailed references throughout it. 


3.1 Learning probabilities in BNs 


First we discuss the discrete case. We assume that the Bayesian network with the 
known underlying graph G includes n nodes each corresponding to a variable X; € X for 


Lig? a! 


474) g~ 


i= 1,...,n. Also, each variable X; is discrete, having r; possible values x 
We denote an observed value of X; in l-th observation as X;[{]. If each node is observed 
m times, then we obtain the sample dataset D = {D,, Do,..., Dm} with the sample 
Dı = (Xall), X2[l],..., Xn[l]) indicating the observed values of all the nodes in the l-th 
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sampling. We refer to each D; as a case. If all cases are complete, i.e. no missing values oc- 
curred in the dataset D, it is considered as complete data; otherwise, it is called incomplete 
data. Missing values in data can occur for many different reasons, for instance, people 
filling out a survey may prefer not to answer some questions or certain measurements 
might not be available for some patients in a medical setting. 

There are mainly two categories of methods for parameter estimation in BN: one is 
for dealing with the complete data, and the other is for incomplete data. We will provide 
concise descriptions of two algorithms for the first category such as maximum likelihood 


estimation and Bayesian method; and we will briefly discuss algorithms for the second 


category. 
Assume that as in (2.2) we can write the joint distribution of the variables in X as 
follows 7 
P(X1, Xa,- Xn | 0) = | [ PX: | pag(Xi), 03 
i=1 
for some vector of parameters 0 = (01,...,0n), where 0; is the vector of parameters for 


the local distribution P(X; | pag(X;),6;). For shortness, further in this chapter we will 
write pa(X;) instead of pag(X;). In the case of discrete and completely observed data 
categorical distribution is commonly used. We note that in literature concerning learning 
Bayesian networks this type of distribution is often referred to as multinomial distribution 
or in some cases as unrestricted multinomial distribution (for example Heckerman (2021)) 
to differentiate this distribution from multinomial distributions that are low-dimensional 
functions of pa(X;). 

Hence we assume that each local distribution function is a collection of categorical 


distributions, one distribution for each configuration of its parents, namely 


P(X; = x! | paf, 0;) = ijk > 0; tor 1<k<ri,1 <j oe (3.1) 
whereq;= [| 1; and pał, paî,...,pař denote all possible configurations of pa(X;), 
XjEpa(Xi) 


and 0; = ((Oijk)k2)j-ı are the parameters. Note that the parameter 0;jı is given by 
the difference 1 — )>;"., ijk. For convenience, let us denote the vector of parameters 
Oi; = (ijz, Oij3; - - -, Fijr,) for all 1 < i < n and 1 < j < q; so that 0; = (Og) 

As it is well known, the maximum likelihood estimation (MLE) is a method of estima- 
ting the parameters of a probability distribution by maximizing the likelihood function, 
so that under the assumed statistical model the observed data is the most probable. Ba- 
sically, if C% is the result of a random test for an event C with several possible outcomes 
C1, Co,...,C, it will appear in the maximum likelihood for this event. Hence, the esti- 
mated value of Ĉ will be set as parameter 6 if it maximizes the value of the likelihood 
function P(C | 0). 
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For the general Bayesian network with n nodes we denote the likelihood function as 


L(@: D) = P(D | @) = IGOOR = [[ PE, Koll... X10) = 


l=1 


= [P [PO | pa 8) = [PTRO | pait, 89 = TT 2:0: : D) 


l=1 i=1 i=1 l=1 


(3.2) 


where by pa,[!] = pa(.X;)[/] we denote the l-th observation of the parents vector of the vari- 
able X;. This representation shows that the likelihood decomposes as a product of inde- 
pendent factors, one for each CPD in the network. This important property is called the 
global decomposition of the likelihood function. Moreover, this decomposition is an imme- 
diate consequence of the network structure and does not depend on any particular choice 
of the parameterization for CPDs (see Koller and Friedman (2009)). 

If the conditional distribution of X; given its parents pag(X;) is the categorical dist- 
ribution, then the local likelihood function can be further decomposed as follows 


m m Qi Ti 
L,(6;: D) = |] PCX() | pa;[0, 0) = [1 [ [PŒ = <} | pa,{] = pay, 0:) 
l=1 l=1 j=1 k=1 
qi Ti N( k i) (3.3) 
Tipa; 
= Bi: ’ 
j=1 k=1 


where N(x, pa’) is the number of cases in D for which X; = x* and pa(X;) = pa’. 

Considering that the dataset is complete for each possible value pa! of the parents 
pa(X;) of the node X;, the probability P(X; | pa’) is the independent categorical dist- 
ribution not related to any other configurations pa! of pa(X;) for j 4 l. Therefore, as 
the result of the MLE method we obtain the estimated parameter Ô as follows 


Â — Nig pai) 
SE N(pa!) 


where N (pa! ) denotes the number of cases when the configuration pa! appears in the full 
set of observations for the vector of variables pa(X;). 

Note that in general the MLE approach attempts to find the parameter vector 0 that 
is “the best” given the data C. On the other hand, the Bayesian approach does not 
attempt to find such a point estimate. Instead, the underlying principle is that we should 
keep track of our beliefs about values of 0, and use these beliefs for reaching conclusions. 
In other words, we should quantify the subjective probability we have initially assigned 
to different values of O taking into account new evidence. Note that in representing such 
subjective probabilities we now treat 0 as a random variable. Thus, the Bayesian approach 
is based on the Bayes rule 


_ P(C | 0)p(0) 


Hence, the basic idea of the Bayesian method for parameter learning is the following. 


(3.4) 


We treat @ as a random variable with a prior distribution p(@), and it is very common 
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to set p as the uniform distribution, especially in the case when we have no prior knowledge 
about 0. Given a distribution with unknown parameters and a complete set of observed 
data C, new beliefs about 0, namely p(@ | C), can be estimated according to the previous 
knowledge. The aim is to calculate p(@ | C) which is called the posterior probability of 
the parameter 0. For computational efficiency we want to use a conjugate prior, i.e. when 
the posterior distribution after conditioning on the data is in the same parametric family 
as the prior one. 


Here we assume that each vector @;; has the prior Dirichlet distribution, so that 


; T'(ai;) = Qijk—1 
p(8;;) = Dir(0;; | igi, ++) Qijri) = T; J Oai j (3.5) 
i A i wai (ijn) II i 
where aij; = yi Qijk, ijk > 0, k =1,...,7%, O71, ---, Qijr, are hyperparameters and T (-) 


is Gamma function. This is the standard conjugate prior to both categorical and multi- 
nomial distributions. Hence, the probability of observed samples is 


p(D) = [ree | 0;;)d0;; = 


-| r Tesi ‘x [Lio = (3.6) 


a 
_ _ Tay) "Ma + Ni) 
Play + Nis) TCO) 


where for shortness Nj; = N (zë, pal) and Nj; = N(pa?) = S77, Nig. The integral is 
(r; — 1)-dimensional over the set {6;;,>0, 2<k<1ri, Ylik <1} 

As we have already mentioned, in Bayesian method, if we do not have prior distri- 
bution we assume it to be uniform, which is consistent with the principle of maximum 
entropy in information theory, it maximizes the entropy of random variables with bounded 
support. Thus, if there is no information used for determination of prior distribution, we 
set hyperparameters a; =: = Qr = 1. 

Combining (3.4), (3.5) and (3.6) under the assumptions of parameter independence 
and complete data finally we obtain the posterior distribution as follows 


p(9;; | D) = Dir (0;; | Qij1 + Niji, TE Qijri + Nig) (3.7) 


Therefore, we have an estimate for each parameter 6;;, from data D as follows 


Continuous Variable Networks. When we were discussing the MLE method for disc- 
rete BNs, we mentioned the global decomposition rule which applies to any type of CPD. 
That is, if the data are complete, the learning problem reduces to a set of local lear- 
ning problems, one for each variable. The main difference is in applying the maximum 


likelihood estimation process to CPD of a different type: how we define the sufficient 
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statistics, and how we compute the maximum likelihood estimate from them. In this 
paragraph, we briefly discuss how MLE principles can be applied in the setting of linear 
Gaussian Bayesian networks. 

Consider a variable X with parents U = {U,,...,U;} with linear Gaussian CPD: 


p(X | u) = N (Bo + Biu +--+ + Bruk, 0°). 


Our task is to learn the parameters Oxiu = (Bo, B1,- - - , Bk, 07). To find the MLE values of 
these parameters, we need to differentiate the likelihood function and to solve the equa- 
tions that define a stationary point. As usual, it is easier to work with the log-likelihood 
function. Using the definition of the Gaussian distribution, we have that 


L(0 xu : D) = log Lx (@xju : D) = 


7 2 -3 log(2r0o?°) — alho + Brull] +--+ + Brull] — zii)? 


We consider the gradients of the log-likelihood with respect to all of the parameters 
Bo,..., 8, and o? and as a result we get a number of equations, which describe the solution 
to a system of linear equations. From the Theorem 7.3 in Koller and Friedman (2009), it 
follows that if B is a linear Gaussian Bayesian network, then it defines a joint distribution 
that is jointly Gaussian, and the MLE estimate has to match the constraints implied by it. 

Briefly speaking, to estimate p(X | U) we estimate the means of X and U and 
the covariance matrix of {X }UU from the data. The vector of means and the covariance 
matrix define the joint Gaussian distribution over {X} U U. Then, for example using 
the formulas provided by Theorem 7.3 in Koller and Friedman (2009), we find the unique 
linear Gaussian that matches the joint Gaussian with these parameters. 

The sufficient statistics we need to collect to estimate linear Gaussians are the uni- 
variate terms of the form 5), z[m] and >>, u;{m], and the interaction terms of the form 
Xm tlm] - uilm] and 57, u;[m]-u;[m]. From these we can estimate the mean and the co- 


variance matrix of the joint distribution. 


3.2 Inference in Bayesian networks 


In this section we assume that the network structure is known, meaning we know all 
the existing edges and their directions as well as all the CPDs. The problem of inference 
for BNs is a challenging task on its own and there is a lot of research done on the subject. 
We will not go into much of a detail on the inference since our focus is on learning their 
structure. However, the question of inference is worth mentioning here in order to get 
a wholesome picture of such a powerful tool as BNs. 

First, we discuss what the notion of inference means in the case of BNs. Typically it 


refers to: 


e marginal inference, i.e. finding the probability of a variable being in a certain state, 


given that other variables are set to certain values; or 
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e maximum a posteriori (MAP) inference, i.e. finding the values of a given set of 
variables that best explain (in the sense of the highest MAP probability) why a set 


of other variables have certain values, 


Let us demonstrate both categories of questions using an example. We will use the BN 
structure of a well-known ASIA network (see Figure 3.1) first introduced in Lauritzen 
and Spiegelhalter (1988). It illustrates the causal structure of a patient having a certain 
lung disease based on several factors, one being whether or not the patient has recently 
been to Asia. In this case, an exemplary question on marginal inference might be what 
is the probability of a patient who is a smoker and has dyspnoea having a certain lung 
disease, e.g. lung cancer. For the MAP inference, we might want to know what is the 
most likely set of conditions (with “smoking” and “dyspnoea” excluded) that could have 
caused the symptoms mentioned above. 

Now we provide short descriptions of the most popular exact and approximate infe- 
rence algorithms for BNs. Among them are variable elimination and belief propagation for 
the marginal inference, methods for the MAP inference and the sampling-based inference. 
For the purposes of transparency of the presentation the inference methods for BNs will 
be demonstrated for the discrete and finite case. 


tuberculosis? 


either tub. or lung cancer? 
positive X-ray? 


Figure 3.1: The ASIA Bayesian network structure 


3.2.1 Variable Elimination 


This inference algorithm is defined in terms of so-called factors and is developed to answer 
questions of marginal inference. Factors generalize the notion of CPDs. A factor @ is 
a function of value assignments of a set of random variables V with positive real values. 
The set of variables V is called the scope of the factor. There are two operations on 
factors that are repeatedly performed in a variable elimination algorithm (VE) and hence 


are of great importance. 
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e The factor product. If Vi, V2, and V3 are disjoint sets of variables and we have 
factors ¢,; and ə with scopes V; U Və and V> U V3 respectively, then we define 
the factor product 1 - 2 as a new factor w with the scope V; U V2 U V3 by 


(Vi, Va, V3) = o1(V1, V2) : b2(V2, V3). 


This product is the new factor over the union of the variables defined for each instan- 
tiation by multiplying the value of ¢, on the particular instantiation by the value 
of @2 on the corresponding instantiation. More precisely, 


(vi, V2, V3) = b1(U1, V2) : b2(V2, V3) 
for each instantiation, where vı € Val(Vi), v2 € Val(V2) and v3 € Val(V3). 


e The factor marginalization. This operation “locally” eliminates a set of variables 
from a factor. If we have a factor ¢(V1, V2) over two sets of variables V1, Vo, 


marginalizing Və produces a new factor 
7(V1) = X 4(V1, Və), 
V2 


where the sum is over all joint assignments for the set of variables Vj. More precisely, 


T(v1)= X`  b(v1, 02), 01 € Val(Vi) 


v2€ Val(V2) 
for each instantiation vı € Val(V}). 


Thus, in the context of factors we can write our distribution over all variables as a product 


of factors, where each factor presents a CPD as in (2.2): 
P(X1, Xo, Xn) = |] o:(Aa), (3.8) 
i=1 


where A; = (X;, pag(X;)) represents a set of variables including the i-th variable and its 
parents in the network. 

Now we can describe the full VE algorithm. Assume we want to find marginal dist- 
ribution of a fixed variable from X1,..., Xn. First we need to choose in which order O 
to eliminate remaining variables. The choice of an optimal elimination ordering O is 
an NP-hard problem and it may dramatically affect the running time of the variable 
elimination algorithm. Some intuitions and techniques on how to choose an adequate 
ordering are given for example in Koller and Friedman (2009). For each variable X; 
(ordered according to the ordering O) we perform the following steps: 


e multiply all factors containing X; (on the first round all the ¢; containing X;); 


e marginalize out X; according to the definition of the factor marginalization to obtain 
a new factor T (which does not necessarily correspond to a probability distribution, 
even though each ¢ is CPD); 
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e replace the factors used in the first step with 7. 


Essentially, we loop over the variables as ordered by O and eliminate them in this order. 
Performing those steps we use simple properties of product and summation on factors, 
namely, both operations are commutative and products are associative. The most im- 
portant rule is that we can exchange summation and product, meaning that if a set of 
variables X is not in the scope of the factor @,, then 


Xa -Q2 = 91° ` 2. (3.9) 
x x 


So far we saw that the VE algorithm can answer queries of the form P(V), where V 
is some subset of variables. However, in addition to this type of questions it can answer 


marginal queries of the form 


P(Y,E =e) 


es T 


where P(X, Y,E) is a probability distribution over sets of query variables Y, observed 
evidence variables E, and unobserved variables X. We can compute this probability by 
performing variable elimination once on P(Y,E = e) and then once again on P(E = e) 
taking into account only instantiations consistent with E = e. 

An exemplary run of the VE algorithm is presented in Table 3.1. It corresponds 
to Extended Student example first mentioned in Section 2.2. 


Step Variables Factors used Variables New 
eliminated involved factor 
1 C éc(C), dv(D, C) C,D 7(D) 
2 D éc(G, I, D), n (D) G,I,D T(G, I) 
3 I o1(L), ds(S, L), T2(G, I) G,S,I 73(G, S) 
4 H bu(H, G, J) H,G,J | t4(G,J) 
5 G 73(G,S), T4(G, J), br (L, G) | G, J, L, S | (J, L, S) 
6 S T(J, L, S), O71, L, S) J, L, S T6(J, L) 
T L TelJ, L) J, L T(J) 


Table 3.1: A run of variable elimination for the query P(J). 


3.2.2 Message Passing Algorithms 


Markov random fields. In the framework of probabilistic graphical models there exists 
another technique for compact representation and visualization of a probability distribu- 
tion which is formulated in the language of undirected graphs. This class of models (known 
as Markov Random Fields or MRFs) can succinctly represent independence assumptions 
that directed models cannot represent and the opposite is also true. There are advan- 
tages and drawbacks to both of those methods but that is not the focus of this thesis. 
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We will introduce and discuss MRFs only to the extent we need to properly describe and 
explain notions and methods concerning BNs. Note that the methods provided below for 
marginal and MAP inference are applicable both to MRFs and BNs. 


Definition 3.1. A Markov Random Field (MRF) is a probability distribution over vari- 
ables Xı,..., Xn defined by an undirected graph G in which nodes correspond to vari- 
ables X;. The probability has the form 


1 
P(X), Xo, ney Xn) = Z I] bel Xe); 
cEC 
where C denotes the set of cliques (i.e. fully connected subgraphs) of G and each factor ġe 


is a non-negative function over the variables in a clique. The partition function 


Z= X, [[ (x) 


(#1,.--,2n) ceC 


is a normalizing constant that ensures that the distribution sums to one, where the sum- 


mation is taken over all possible instantiations of all the variables. 


Thus, given a graph G, our probability distribution may contain factors whose scope 
is any clique in G and the clique can be a single node, an edge, a triangle, etc. Note that 
we do not need to specify a factor for each clique. 

It is not hard to see that Bayesian networks are a special case of MRFs with a norma- 
lizing constant equal to 1 where the clique factors correspond to CPDs. One can notice 
that if we take a directed graph G, add side edges to all parents of a given node and remove 
their directionality, then the CPDs (seen as factors over each variable and its ancestors) 
factorize over the resulting undirected graph. The resulting process is called moralization 
(see Figure 3.2). A Bayesian network can always be converted into an undirected network 


with normalizing constant 1. 


——_ 


O (-—) 


Figure 3.2: Moralization of a Bayesian network. 


(B) Moralization EF 


Message passing. As we mentioned above, the VE algorithm can answer marginal 
queries of the form P(Y | E = e). However, if we want to ask the model for another 
query, e.g. P(Y2 | Ey = e2), we need to restart the algorithm from scratch. Fortunately, in 
the process of computing marginals, VE algorithm produces many intermediate factors T 
as a side-product of the main computation, which turn out to be the same as the ones 


that we need to answer other marginal queries. 
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Many complicated inference problems can be solved by message-passing algorithms, in 
which simple messages are passed locally among simple elements of the system. An illust- 
rative example was shown in the book MacKay (2003) for a problem of counting soldiers. 
Consider a line of soldiers walking in the mist. The commander, which is in the line, 
wishes to count the soldiers. The straightforward calculation is impossible because of 
the mist. However, it can be done in a simple way which does not require any complex 
operations. The algorithm requires the soldiers’ ability to add two integer numbers and 
add 1 to it. The algorithm consists of the following steps (for example see Figure 3.3): 


e the front soldier in the line says the number ‘one’ to the soldier behind him, 
e the rearmost soldier in the line says the number ‘one’ to the soldier in front of him, 


e the soldier, which is told a number from the soldier ahead or the soldier behind, 
adds 1 to it and passes the new number to the next soldier in the line on the other 


side. 


1 2 3 4 
o-9-p-e~9 
J \ 4 wv S 3 i N 2 VN v N 


Commander 


Figure 3.3: A line of soldiers counting themselves using message-passing rule-set. 


Hence, the commander can find the global number of soldiers by simply adding to- 
gether the numbers: heard from the soldier in front of him, from the soldier behind him 
and 1. This method makes use of a property of the total number of soldiers: the number 
can be written as the sum of the number of soldiers in front of a point and the number 
behind that point, two quantities which can be computed separately, because the two 
groups are separated by the commander. When this requirement is satisfied this message- 
passing algorithm can be modified for a general graph with no cycles (as an example see 
Figure 3.4a). When the graph has no cycles (see Figure 3.4a) for each soldier we can 
uniquely separate the group into two groups, ‘those in front’, and ‘those behind’ and 
perform the algorithm above. However, it is not always possible for a graph with cycles, 
for instance for a soldier in a cycle (such as ‘Jim’) in Figure 3.4b such a separation is not 
unique. 

Using the same principle we will now describe the message passing for tree-structured 
networks (called belief propagation, BP for short) and then the modification of the method 


for general networks (called clique tree algorithm). 


Belief propagation. Let us first look at tree-structured graphs. Consider what happens 
if we run the VE algorithm on a tree in order to compute a marginal distribution P(X;). 


30 


On 3 KS 
A /5 F CI 
C vA ee a - EN AN PEE 
Ne AN pe Oo \ ©) # ji 
e < ? p Hp e ) Commander 
a Commander á 7 PTA 
ma P h o 
/ ee) x = < 
ej ? La S e a aA ; 
we Jim pa < Jim ie 


(a) No cycles (b) Contains a cycle. 


Figure 3.4: A swarm of soldiers. 


We can easily find the optimal ordering for this problem by rooting the tree at the node 
associated with X; and iterating through the nodes in post-order (from leaves to the root), 
just like for a swarm of soldiers with no cycles. At each step, we will eliminate one of 
the variables, say X,; this will involve computing the factor 7;(x,) = ye Gli 257s ee), 
where X;, is the parent of X; in the tree. At a later step, the variable X; will be eliminated 
in the same manner, i.e. 7),(x;,) will be passed up the tree to the parent X, of X; in order to 
be multiplied by the factor (x1, £k) before being marginalized out. As a result we obtain 
the new factor n(x). The factor 7;(x;) can be thought of as a message that X; sends 
to Xx that summarizes all of the information from the subtree rooted at the node X;. 
We can visualize this transfer of information using arrows on the tree, see Figure 3.5. 
At the end of the VE algorithm, the node X; receives messages from all of its children 
and the final marginal P(X;) is obtained by marginalizing those messages out. 


X3 
mı3(x3) A A| NA 115303) 
X1 
mz (x1) Y4 
m43(xX3) 


Figure 3.5: Message passing order when using VE to compute P(X3) on a small tree. 


With the same indices as above, suppose that after computing P(X;) we want to 
compute P(X;,) as well. We would again run VE for the new tree rooted at the node Xx, 


waiting until it receives all messages from its children. Note that the new tree consists 
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of two parts. The first one is the subtree rooted at X, with all its descendants from 
the original tree (i.e. rooted at X;). The other part is the subtree rooted at X, (which 
was the parent of X, in the original tree, but now is the child of X;,). Therefore, this 
part contains the node X;. The key insight is that the messages received by X, from X; 
now will be the same as those received when X; was the root. Thus, if we store the inter- 
mediary messages of the VE algorithm, we can quickly compute other marginals as well. 
Notice for example, that the messages sent to Xp from the subtree containing X; will need 
to be recomputed. So, how do we compute all the messages we need? Again, referring to 
the soldier counting problem, a node is ready to transmit a message to its parent after it 
has received all the messages from all of its children. All the messages will be sent out 
after precisely 2|E| steps, where |E| is the number of edges in the graph, since each edge 
can receive messages only twice. 

To define belief propagation (BP) algorithm formally let us see what kind of messages 
can be sent. For the purposes of marginal inference we will use suwm-product message 
passing. This algorithm is defined as follows: while there is a node X; ready to transmit 
to X; it sends the message 


mesa) = 3 aaen) [YP mota) 
Tk JENb(k)\{I} 
where Nb(k) \ {1} means all the neighbours of the k-th node, excluding l-th node. Note 
that this message is precisely the factor 7 that X; would transmit to X; during a round 
of variable elimination with the goal of computing P(X;), and also note that the product 
on the RHS of this equation naturally equals to 1 for leaves in the tree. 
After having computed all messages, we may answer marginal queries over any vari- 


able X; in constant time using the equation: 


P(X;) x Y(X) J| mle), 
leNb(j) 
where w(X;) is a product of all factors ¢ whose scope contains X;. In case of BNs we 
have the equality instead of proportionality. 


Clique Tree Algorithm. First let us define what is meant by a clique tree. Clique tree 
is an undirected tree such that its nodes are clusters C; of variables, meaning C; is a subset 
of a set of all variables {X1,...,X;,}. Each edge between clusters C; and C; is associated 
with a sepset (separation set) S;; = C; N Cj. See a simple example demonstrating a 
clique tree for a chain network in Figure 3.6. 

So far we assumed that the graph is a tree. What if that is not the case? Then the 
clique tree algorithm (also called the junction tree algorithm in the literature) can be 
used; it partitions the graph into clusters of variables so that interactions among clusters 
will have a tree structure, i.e. a cluster will be only directly influenced by its neighbours 
in the tree, we denote it 7. Then we can perform message passing on this tree. This leads 
to tractable global solutions if the local (cluster-level) problems can be solved exactly. 

In addition, clique trees must satisfy two following properties: 
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Figure 3.6: An example of a chain network consisting of three variables A, B and C; 
corresponding MRF and a clique tree with C; = {A, B}, C2 = {B,C} and S12 = {B}. 


1. family preservation, i.e. for each factor @ there is a cluster such that factor’s scope 
is a subset of the cluster; 


2. running intersection property (RIP), i.e. for each pair of clusters C;, C; and a vari- 
able X € C; MC; all clusters and sepsets on the unique path between C; and C; 
contain the variable X. 


Note that we may always find a trivial clique tree with one node containing all the vari- 
ables in the original graph, but obviously such trees are useless. Optimal trees are the ones 
that make the clusters as small and modular as possible; unfortunately, as in case of VE, 
the problem of finding the optimal tree is also WVP-hard. A special case when we can find 
it is when we originally have a tree, in this case we can put each connected pair of nodes 
into a separate cluster, it is easy to check that both conditions are met. One of the prac- 
tical ways to find a good clique tree is to use a simulation of VE, i.e. the elimination order 
fixed for VE will induce the graph from which we will take maximal cliques and set them 
as our clusters and form a tree. RIP will be satisfied automatically. Note that we do not 
need to run VE, just to simulate it for a chosen ordering and get the induced graph. More 
formally: 


Definition 3.2. Let ® be a set of factors (CPDs in the case of Bayesian Networks) 
over X = {Xj,...,Xn}, and < be an elimination ordering for some subset X C X. 
The induced graph denoted by I is an undirected graph over X , where X; and Xj 
are connected by an edge if they both appear in some intermediate factor w generated by 


the VE algorithm using < as an elimination ordering. 


In Figure 3.7 there is an example of an induced graph for the Student example using 
the elimination ordering of Table 3.1, cliques in that graph and a corresponding clique 
tree. One can see that RIP is satisfied, for a proof that trees corresponding to induced 
graphs by VE will satisfy RIP see Koller and Friedman (2009). 

Now let us define the full clique tree algorithm. First, we define the potential w;(C;) 
of each cluster C; as the product of all the factors ¢ in G that have been assigned to C;. 
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By the family preservation property, this is well-defined, and we may assume that our 
distribution is of the form 


P(X,...,Xn) = sees, 


Then, at each step of the algorithm, we choose a pair of adjacent clusters C;, C; in 
a tree graph 7 and compute a message whose scope is the sepset S;,; between the two 
clusters 

mi+;(Sij) = > wi(C;) I] mi-+i(Sz,). (3.10) 
Ci\Si,j lEND(i)\{5} 

In the context of clusters, Nb(i) denotes the set of indices of neighboring clusters of C}. 
We choose C; and C; only if C; has received messages from all of its neighbors except 
C;. Just as in belief propagation, this procedure will terminate in exactly 2|E;| steps 
because this process is equivalent to making an upward pass and a downward pass. In the 
upward pass, we first pick a root and send all messages towards it starting from leaves. 
When this process is complete, the root has all the messages. Therefore, it can now send 
the appropriate message to all of its children. This algorithm continues until the leaves 
of the tree are reached, at which point no more messages need to be sent. This second 
phase is called the downward pass. After it terminates, we will define the belief of each 


cluster based on all the messages that it receives 


blc) = lC) J| mlS. (3.11) 
) 


lENb(i 


These updates are often referred to as Shafer-Shenoy updates and the full procedure 
is also referred as sum-product belief propagation. Then each belief is the marginal of 
the clique 
B(C,) = Y P(X,..., Xn). 
X\Ci 

Now if we need to compute the marginal probability of a particular variable X we 
can select any clique whose scope contains X, and eliminate the redundant variables 
in the clique. A key point is that the result of this process does not depend on the clique 
we selected. That is, if X appears in two cliques, they must agree on its marginal. Two 
adjacent cliques C; and C; are said to be calibrated if 


` B(C;) = > B;(C;). 


Ci\Si,; C;\Si,5 


A clique tree 7 is calibrated if all pairs of adjacent cliques are calibrated. For a calibrated 
clique tree, we use the term clique beliefs for 3;(C;) and sepset beliefs for j1;,;(S;,;) defined 
as either side of the above equality. 

As the end result of sum-product belief propagation procedure we get a calibrated 
tree, which is more than simply a data structure that stores the results of probabilistic 
inference for all of the cliques in the tree, i.e. their beliefs (3.11). It can also be viewed 
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as an alternative representation of the joint measure over all variables. For sepset beliefs 
we have that 

Mig (Sig) = Min (Si) 18:3). 
Using this fact at convergence of the clique tree calibration algorithm, we get the unnor- 


malized joint measure P as 


PUG, 005 Xn) = I] wb,(C,) = i aon (3.12) 


where the product in the numerator is over all cliques and the product in the denominator 


is over all sepsets in the tree. As a result we get a different set of parameters that 
captures unnormalized measure that defined our distribution (in case of BNs it is simply 
the distribution) and there is no information lost in the process. Thus, we can view 
the clique tree as an alternative representation of the joint measure, one that directly 
reveals the clique marginals. 

The second approach, mathematically equivalent but using a different intuition, is 
message passing with division. In sum-product belief propagation messages were passed 
between two cliques only after one had received messages from all of its neighbors except 
the other one as in (3.10) and the resulting belief was (3.11). Nonetheless, a different 
approach to compute the same expression is to multiply in all of the messages, and then 
divide the resulting factor by the message from the other clique to avoid double-counting. 
To make this notion precise, we must define a factor-division operation. 


Let X and Y be disjoint sets of variables and let ¢,; and ə be two factors with scopes 


XUY and Y respectively. Then we define the division s as a factor-division ~ with 
2 
the scope X U Y as follows 


X,Y 
v(x, y) = A). 
2{Y) 
where we define 2 = 0. We now see that we can compute the expression of equation 


(3.10) by computing the beliefs as in equation (3.11) and then dividing by the remaining 


Zoas; Fi(Ci) 
mji(Sig) 


The belief of the j-th clique is updated by multiplying its previous belief by m,_,; and 


message 


mi+;(Si5) = 


dividing it by the previous message passed along this edge (regardless of the direction) 
stored in sepset belief j1;,; to avoid double counting. This algorithm is called belief update 


message passing and is also known as the Lauritzen-Spiegelhalter algorithm. 


3.2.3 MAP inference 


The maximum a posteriori (MAP) problem has a broad range of applications, in computer 
vision, computational biology, speech recognition, and more. By using MAP inference we 
lose the ability to measure our confidence (or uncertainty) in our conclusions. Never- 


theless, there are good reasons for using a single MAP assignment rather than using 
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Coherence Coherence 


(c) 


Figure 3.7: (a) Induced graph for VE in the Student example, using the elimination order 
of Table 3.1 (b) Cliques in the induced graph:{C, D}, {D,I,G}, {G, I, S}, {G, J, S, L} 
and {G, H, J}. (c) Clique tree for the induced graph. 


the marginal probabilities of the different variables. The first reason is the preference 
for obtaining a single coherent joint assignment, whereas a set of individual marginals 
may not make sense as a whole. The second is that there are inference methods that are 
applicable to the MAP problem and not to the task of computing probabilities, so that 
the former may be tractable even when the latter is not. The problem of finding the MAP 
assignment in the general case is MP—hard (Cooper (1990)). 

There are two types of Maximum a Posteriori (MAP) inference: a MAP query and 
a marginal MAP query. Assume first that the set of all variables X = Y UE consists of 
two disjoint sets, where E is the evidence meaning that we know values of those variables. 
Then a MAP query aims to find the most likely assignment to all of the non-evidence 
variables Y 

MAP(Y |E = e) = argmax P(Y = y | E = e). 
y 


Now assume that the set of all variables X = Y U W UE consists of three disjoint sets, 
where E is still the evidence. In this case a marginal MAP query aims to find the most 
likely assignment to the subset Y, marginalizing over the rest of the variables W 


MAP(Y |E = e) = argmax P(Y = y | E = e) = 
y 


= argmax ò P(Y=y,W=w|E=e). 
Y w 
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Both tasks can be solved within the same variable elimination (VE) and message passing 
frameworks as marginal inference, where instead of summation we use maximization. The 
second type of query is much more complicated both in theory and in practice since it 
involves both maximization and summation. In particular, exact inference methods such 
as VE can be intractable, even in simple networks. Hence, first we will briefly discuss 
them and then introduce some more efficient methods. 

Recall that while discussing VE we introduced two operations on factors, which were 
the foundation in performing the algorithm. Now we need to introduce one additional 
operation called the factor maximization. Let X be a set of variables, and Y ¢ X a variable 
not belonging to the set X. Let ¢(X,Y) be a factor over those variables. We define 
the factor maximization of Y in @ to be a factor Y over X such that: 


Y(X) = max (X,Y). 


More precisely, 


y(x) = eo olx, y) 


for each instantiation a € Val(X). Similarly to the property (3.9) we have that if a set 
of variables X is not in the scope of the factor ¢,, then 


max(¢1 : $2) = 1 Max Q2 (3.13) 


and 
max(ġı + %2) = dı + max do. (3.14) 


This leads us to a maz-product variable elimination algorithm for a general MAP query, 
which is constructed in the same way as a sum-product variable elimination algorithm in 
Subsection 3.2.1, but we replace the marginalizing step (summation) with maximization 
over corresponding variables. 

This way we find the maximum value for the joint probability, though the original 
and more interesting problem is to find the most probable assignment corresponding to 
that maximum probability. This process is called a traceback procedure, which is quite 
straightforward (details can be found in Koller and Friedman (2009)). In the process of 
eliminating variables we find their maximizing value given the values of the variables that 
have not yet been eliminated. When we pick the value of the final variable, we can then 
go back and pick the values of the remaining variables accordingly. 

Recall that the joint distribution P in Bayesian networks is represented by a product 
of factors, where each factor coincides with a CPD (we introduced this representation 
in (3.8)). Then we can write the marginal MAP query as 
argmax X P(y,W) = argmax ` I] Pi, 

Ww d W i 


yY 


where we skipped the evidence set for the transparency of notation since it does not effect 


the main point of discussion. First we compute 
max pp I] Qi. 
YW i 
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This form immediately suggests an algorithm combining the ideas of sum-product and 
max-product variable elimination. Specifically, the summations and maximizations out- 
side the product can be viewed as operations on factors. Thus, to compute the value of 
this expression, we simply have to eliminate the variables in W by summing them out, 
and the variables in Y by maximizing them out. When eliminating a variable X, whether 
by summation or by maximization, we simply multiply all the factors whose scope in- 
volves X, and then eliminate X to produce the resulting factor. The ability to perform 
this step is justified by the interchangeability of factor summation and maximization with 
factor product (properties (3.9) and (3.13)). The traceback procedure to find the most 
probable assignment can also be found in Koller and Friedman (2009). 

At first glance it seems that algorithms for both queries have the same complexity 
but that is not the case. It can be shown that even on very simple networks, elimination 
algorithms can require exponential time to solve a marginal MAP query (see Example 13.7 
in Koller and Friedman (2009)). The difficulty comes from the fact that we are not free 
to choose an arbitrary elimination ordering. When summing out variables, we can utilize 
the fact that the operations of summing out different variables commute. Thus, when 
performing summing-out operations for sum-product variable elimination, we could sum 
out the variables in any order. Similarly, we could use the same flexibility in the case of 
max-product elimination. Unfortunately, the max and sum operations do not commute. 
Thus, in order to maintain the correct semantics of marginal MAP queries, as specified 
in the equation, we must perform all the variable summations before we can perform any 
of the variable maximizations. 

We can also use the message passing framework, or more general case of clique tree 
algorithm, to MAP inference. In Subsection 3.2.2 we used clique trees to compute the sum- 
marginals over each of the cliques in the tree. Here, we compute a set of max-marginals 
over each of those cliques. By the max-marginal of a function f defined on the set X 
relative to a set of variables Y C X we denote such a factor that for each y € Y 

Max Marginals(y) = max f(a) 
(2)y=y 

determines the value of the unnormalized probability of the most likely joint assign- 
ment æ € X consistent with y. We compute the whole set for two reasons. First, the set of 
max-marginals can be a useful indicator for how confident we are in particular components 
of the MAP assignment. Second, in many cases, an exact solution to the MAP problem via 
a variable elimination procedure is intractable. In this case, to compute approximate max- 
marginals we can use message passing procedure in cluster graphs, similar to the clique 
tree procedure. These pseudo-max-marginals can be used for selecting an assignment; 
while this assignment is not generally the MAP assignment, we can nevertheless provide 
some guarantees in certain cases. As before, our task consists of two parts: computing 
the max-marginals and decoding them to extract a MAP assignment. 

As for the first part, in the same way as we modified sum-product VE to sum-product 
message-passing we modify max-product VE to max-product belief propagation algorithm 


in clique trees. The resulting algorithm executes precisely the same initialization and 
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overall message scheduling as in the sum-product belief propagation algorithm. The only 
difference is that we use max-product rather than sum-product message passing. As 
a result of running the algorithm we will get a set of max-marginals for every clique of 
our clique tree. 

Each belief is the max-marginal of the clique 8;(C;) = MazMarginal,(C;) and all 
pairs of adjacent cliques are maz-calibrated 

Hig (Sig) = Bie (Ci) = see B;(C;). 
Similarly to sum-product message passing we get reparameterization of the distribution 
in the form (3.12) with corresponding beliefs of the max-product belief propagation algo- 
rithm. 

Now we need to decode those max-marginals to get a MAP assignment. In the case 
of variable elimination, we had the max-marginal only for a single last to be eliminated 
variable and could identify the assignment for that particular variable. To compute the as- 
signments to the rest of the variables, we had to perform a traceback procedure. Now 
the situation appears different. One obvious solution is to use the max-marginal for each 
variable to compute its own optimal assignment, and thereby compose a full joint assign- 
ment to all variables. However, this simplistic approach works only in case when there is 
a unique MAP assignment, equivalently, each max-marginal has a unique maximal value. 
For generic probability measures this is not a very rigid constraint, thus, we can find the 
unique MAP assignment by locally optimizing the assignment to each variable separately. 

Otherwise, in most cases to break ties we can introduce a slight random perturbation 
into all of the factors, making all of the elements in the joint distribution have slightly dif- 
ferent probabilities. However, there might be cases when we need to preserve the structure 
in relationships between some variables, for example some variables can share parameters 
or there might be some deterministic structure that should be preserved. Under these 
circumstances we find a locally optimal assignment using for example traceback proce- 
dure. Afterwards we can verify if this assignment is a MAP assignment (for procedure 


and verification see Koller and Friedman (2009)). 


MAP as Linear Optimization Problem. In MAP inference we search for assign- 
ments which maximize a certain measure, in our case either the joint probability over all 
non-evidence variables or the probability over some set of variables. Therefore, it is na- 
tural to consider it directly as an optimization problem. There exists extensive literature 
on optimization algorithms and we can apply some of those ideas and algorithms to our 
specific case. 

The main idea here is to reduce our MAP problem to an Integer Linear Program- 
ming (ILP) problem, i.e. an optimization problem over a set of integer valued variables, 
where both the objective and the constraints are linear. First, to define ILP problem we 
need to turn the product representation of the joint probability as in (3.8) into a sum, 
replacing the probability with its logarithm. It is possible because all the factors (CPDs) 
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are positive. Hence, we want to compute 
argmax [[ 6:(A:) = argmax 5 og(#i(A,). 
E ii § Ge] 


where € is a general assignment for the whole vector of variables in the network, and 
A; = (Xi, pag(X;i)) represents a set of variables including the i-th variable and its parents 
in the network. Note that the whole discussion in this paragraph is actually identical for 
MRFs with positive factors, the only difference is the number of factors, but since they are 
not the focus of this thesis, we formulate everything in the Bayesian networks framework. 

For variable indices r € {1,...,n} we define the number of corresponding possible 
vector instantiations n, = | Val(A,.)|. For any joint assignment £, if this assignment con- 
strained to the variables from A, takes the value of af, j = {1,...,n,}, ie. ĉa, = af, 
then the factor log(¢,) makes a contribution to the objective of a quantity denoted as 
n; = log(¢,(a})). 

We introduce optimization variables q(x), where r enumerates the different factors, 
and 7 enumerates the different possible assignments to the variables from A,. These 
variables take binary values, so that q(a/) = 1 if and only if A, = a? and 0 otherwise. 
It is important to distinguish the optimization variables from the random variables in 
our original graphical model; here we have an optimization variable q(a#?) for each joint 
assignment af to the model variables A... 

Let q denote a vector of the optimization variables {q(a/), 1<r<n, 1<j<n,} 
and 7 denote a vector of the coefficients nf sorted in the same order. Both of these are 
vectors of dimension N = 5>"_, nr. With this interpretation, the MAP objective can be 
rewritten as: a 

maxX, > malai) (3.15) 
r=1 j=1 
or, in shorthand, max n! q. 

Now that we have an objective to maximize we need to add some consistency con- 
straints that would guarantee that an assignment q € {0,1} we get as a solution of 
optimization problem is legal, meaning it corresponds to some assignment in X. Namely, 
first we require that we restrict attention to integer solutions, then we construct two 
constraints to make sure that these integer solutions are consistent. The first constraint 
enforces the mutual exclusivity within a factor and the second one implies that factors 
in our network agree on the variables in the intersection of their scopes. In this way we 
reformulate the MAP task as an integer linear program, where we optimize the linear 
objective of equation (3.15) subject to discussed constraints. We note that the problem 
of solving integer linear programs is itself MP-hard, so that we do not avoid the basic 
hardness of the MAP problem. 

One of the methods often used to tackle ILP problems is the method of linear program 
relaxation. In this approach we turn a discrete, combinatorial optimization problem into 


a continuous problem. This problem is a linear program (LP), which can be solved in 
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polynomial time, and for which a range of very efficient algorithms exists. One can 
then use the solutions to this LP to obtain approximate solutions to the MAP problem. 
To perform this relaxation, we substitute the condition that the solutions are integer with 
a relaxed constraint that they are non-negative. 

This linear program is a relaxation of our original integer program, since every assign- 
ment to q that satisfies the constraints of the integer problem also satisfies the constraints 
of the linear program, but not the other way around. Thus, the optimal value of the ob- 
jective of the relaxed version will be no less than the value of the (same) objective in 
the exact version, and it can be greater when the optimal value is achieved at an assign- 
ment to q that does not correspond to a legal assignment €. An important special case 
are tree-structured graphs, in which the relaxation is guaranteed to always return integer 
solutions, which are in turn optimal (for proof and more detailed discussion see Koller 
and Friedman (2009)). Otherwise we get approximate solutions, which in order we need 
to transform into integer (and legal) assignments. 

One approach is a greedy assignment process, which assigns values to the variables X; 
one at a time. Another approach is to round the LP solution to its nearest integer value. 
This approach works surprisingly well in practice and has theoretical guarantees for some 
classes of ILPs (Koller and Friedman (2009)). 

An alternative method for the MAP problem which also comes from the optimization 
theory is called dual decomposition. Dual decomposition uses the principle that our prob- 
lem can be decomposed into sub-problems, together with linear constraints (the same as 
in ILP) that enforce some notion of agreement between solutions to the different prob- 
lems. The sub-problems are chosen such that they can be solved efficiently using exact 
combinatorial algorithms. The agreement constraints are incorporated using Lagrange 
multipliers, it is called Lagrangian relaxation, and an iterative algorithm - for example, 
a subgradient algorithm - is used to minimize the resulting dual. The initial work on 
dual decomposition in probabilistic graphical models was focused on the MAP problem 
for MRFs (see Komodakis et al. (2007)). 

By formulating our problem as a linear program or its dual, we obtain a very flexible 
framework for solving it; in particular, we also can easily incorporate additional con- 
straints into the LP, which reduce the space of possible assignments of q, eliminating 
some solutions that do not correspond to actual distributions over VY. The problems are 
convex and in principle they can be solved directly using standard techniques, but the size 
of the problems is very large, which makes this approach unfeasible in practice. However, 
the LP has special structure: when viewed as a matrix, the equality constraints in this 
LP all have a particular block structure that corresponds to the structure of adjacent 
clusters. Moreover, when the network is not densely connected, the constraint matrix is 
also sparse, thus, standard LP solvers may not be fully suited for exploiting this special 
structure. The theory of convex optimization provides a wide spectrum of tools, and 
some are already being adapted to take advantage of the structure of the MAP problem 
(see for example, Wainwright et al. (2005), Sontag and Jaakkola (2007)). The empirical 
evidence suggests that the more specialized solution methods for the MAP problems are 
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often more effective. 


Other methods. Another method for solving a MAP problem is local search algo- 
rithms. It is a heuristic-type solution, which starts with an arbitrary assignment and 
performs “moves” on the joint assignment that locally increase the probability. This tech- 
nique does not offer theoretical justification; however, we can often use prior knowledge 
to come up with highly effective moves. Therefore, in practice, local search may perform 
extremely well. 

There are also searching methods that are more systematic. They search the space 
so as to ensure that assignments that are not considered are not optimal, and thereby 
guarantee an optimal solution. Such methods generally search over the space of partial 
assignments, starting with the empty assignment and successively assigning variables one 
at a time. One such method is known as branch-and-bound. 

These methods have much greater applicability in the context of marginal MAP prob- 
lem, where most other methods are not currently applicable. In the next subsection 
we discuss sample-based algorithms which can be applied both to marginal and MAP 


inference. 


3.2.4 Sampling-based methods for inference 


In practice, the probabilistic models that we use can often be quite complex, and simple 
algorithms like VE may be too slow for them. In addition, many interesting classes of 
models may not have exact polynomial-time solutions at all, and for this reason, much re- 
search effort in machine learning is spent on developing algorithms that yield approximate 
solutions to the inference problem. In this subsection we consider some sampling methods 


that can be used to perform both marginal and MAP inference queries; additionally, they 


can compute various interesting quantities, such as the expectation E[f(X)] of a function 
of the random vector distributed according to a given probabilistic model. 

In general, sampling is rather a hard problem. The aim is to generate a random sample 
of the observations of X. However, our computers can only generate samples from very 
simple distributions, such as the uniform distribution over [0, 1]. All sampling techniques 
involve calling some kind of simple subroutine multiple times in a properly constructed 
way. For example, in case of multinomial distribution with parameters 6;,...,6, instead 
of directly sampling a multinomial variable we can sample a single uniform variable pre- 
viously subdividing a unit interval into k regions with region 7 having size 0;. Then we 


sample uniformly from [0,1] and return the value of the region in which our sample falls. 


Forward sampling. Now let us return to the case of Bayesian networks (BN). We can 
apply the same sampling technique to BNs with multinomial variables. We start from 
the nodes which do not have parents, these variables simply have multinomial distribution, 
and we go down the network to the next generation as arrows point out until we reach 


the leaves. Therefore, for a particular node we need to wait until all of its parents are 
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sampled. When we know all the values of parents the variable naturally has multinomial 
distribution. In the Student example to sample student’s grade, we would first sample 
an exam difficulty d’ and an intelligence level i’. Then, once we have samples d’ and i’, we 
generate a student grade g’ from P(g | d’,i’). There is one problem though, as we cannot 


perform it in case of having evidence for any variables besides roots. 


Monte Carlo and rejection sampling. Algorithms that construct solutions based on 
a large number of samples from a given distribution are referred to as Monte Carlo (MC) 
methods. Sampling from an arbitrary distribution p lets us compute integrals of the form 


Expl f(X)] = E f(@)p(2), 


where the summation extends over all possible values of X and p can be thought of as 
the density of X with respect to counting measure. Below we follow the same interpreta- 
tion also with regards to joint and conditional distributions. 

If f(X) does not have special structure that matches the BN structure of p, this 
integral will be impossible to compute analytically; instead, we will approximate it using 
a large number of samples from p. Using Monte Carlo technique we approximate a target 


expectation with 


T 
Expl O] S r= 5 Sle), 


T 


where x!,...,a7 are samples drawn according to p. It is easy to show that Ir is an un- 


biased estimator for Ex ,|f(X)| and its variance can be made arbitrarily small with 
a sufficiently large number of samples. 

Now let us consider rejection sampling as a special case of Monte Carlo integration. 
For example, suppose we have a Bayesian network over the set of variables X = ZU E. 
We may use rejection sampling to compute marginal probabilities P(E = e). We can 
rewrite the probability as 


P(E = e) = X_ P(Z=z,E=e)=) P(X =a)I(E=e) =Ex.,[I(E=e)| 


and then take the Monte Carlo approximation. In other words, we draw many samples 


from p and report the fraction of samples that are consistent with the value of the marginal. 


Importance sampling. Unfortunately, rejection sampling can be very wasteful. If 
P(E = e) equals, say, 1%, then we will discard 99% of all samples. A better way of 
computing such integrals uses importance sampling. The main idea is to sample from 
an auxiliary distribution q (hopefully with q(x) roughly proportional to f(a) - p(a)), 
and then reweigh the samples in a principled way, so that their sum still approximates 


the desired integral. 
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More formally, suppose we are interested in computing Ex.,|f(X)]. Adopting anal- 


ogous convention regarding notation for probability distribution we may rewrite this in- 


tegral as 
Sxl OXI] = Do Sewe) = Yo re) ae) = 
= Exn,|f(X)w(X)] ~ Treue) 
p(x) 


where w(x) = and the samples a’ are drawn from q. In other words, instead 


q(x) 


of sampling from p we may take samples from q and reweigh them with w(æ); the ex- 


pected value of this Monte Carlo approximation will be the original integral. By choosing 


ea 
Woe a 


the denominator is the quantity we are trying to estimate in the first place and sampling 


we can set the variance of the new estimator to zero. Note that 


from such q is NP-hard in general. 
In the context of our previous example for computing P(E = e), we may take q to be 


the uniform distribution and apply importance sampling as follows: 


—e)=E e|z) =E e | 2) 22)] - 
P(E = e) =E,Wp[p(e | z)] = Ezng K | Fel 


= Bag PEE] <8, loela] GD wela’), 


t=1 


ple, z) 


(z) 


is not too far from uniform, this will converge to the true probability after only a very 


where we(z) = . Unlike rejection sampling, this will use all the samples; if p(z | e) 


small number of samples. 


Markov chain Monte Carlo. Now let us turn to performing marginal and MAP 
inference using sampling. We will solve these problems using a very powerful technique 
called Markov chain Monte Carlo (MCMC). 

A key concept in MCMC is that of a Markov chain, which is a sequence of random 
elements having Markov property (see 2.3). A Markov chain ¥ = (Xo, X1, Xo2,...) with 
each random vector X; taking values from the same state space Val(4) is specified by 
the initial distribution P(Xo = a), x € Val(#), and the set of transition probabilities 


P(Xp44 = a’ | X; = x) 


for x, x’ € Val(#), which do not depend on k (in this case the Markov chain is called 
homogeneous). Therefore, the transition probabilities at any time in the entire process 
depend only on the given state and not on the history of the process. In what follows, 
we consider finite state space only so we may assume Val(¥) = {1,...,d}, unless stated 


otherwise. 
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If the initial state Xo is drawn from a vector of probabilities pọ, we may represent 
the probability p; of ending up in each state after t steps as 


pi = T*po, 


where T' denotes the transition probability matrix with Ty = P(Xk+ = i | X = J), 
i,j € {1,...,d}, and T* denotes matrix exponentiation. If the limit lim p; = 7 exists, it 
is called a stationary distribution of the Markov chain. A sufficient condition for 7 to be 
a stationary distribution is called detailed balance: 


mj) Tig = w(t) Ty 


for all i, j € Val(X). 

The high-level idea of MCMC is to construct a Markov chain whose states are joint 
assignments to the variables in the model and whose stationary distribution is equal to 
the model probability p. Then, running the chain for a number of times, we obtain 
the sample from the distribution p. In order to construct such a chain, we first recall 
the conditions under which stationary distributions exist. This turns out to be true under 
two sufficient conditions: irreducibility, meaning that it is possible to get from any state x 
to any other state x’ with positive probability in a finite number of steps, and aperiodicity, 
meaning that it is possible to return to any state at any time. In the context of continuous 
variables, the Markov chain must be ergodic, which is a slightly stronger condition than 
the above. For the sake of generality, we will require our Markov chains to be ergodic. 

At a high level, MCMC algorithms will have the following structure. They take as 
an argument a transition operator T specifying a Markov chain whose stationary distri- 
bution is p, and an initial assignment Xo = Xo of the chain. An MCMC algorithm then 
performs the following steps: 


1. Run the Markov chain from a9 for B burn-in steps. 
2. Run the Markov chain for N sampling steps and collect all the states that it visits. 


The aim of the burn-in phase is to wait until the state distribution is reasonably close to p. 
Therefore, we omit the first B states visited by the chain and then we collect a sample from 
the chain of the size N. A common approach to set the number B is to use a variety of 
heuristics to try to evaluate the extent to which a sample trajectory has “mixed”, i.e. when 
it is reasonably close to p (see Koller and Friedman (2009)). Also Geyer (2011) advocates 
that burn-in is unnecessary and uses other ways of finding good starting points. Gelman 
and Shirley (2012) propose to discard the first half of generated sequences. We may then 
use these samples for Monte Carlo integration (or in importance sampling). We may 
also use them to produce Monte Carlo estimates of marginal probabilities. Finally, we 
may take the sample with the highest probability and use it as an estimate of the mode 
(i.e. perform MAP inference). 

Before we discuss two most important special cases, note that sampling-based methods 


have theoretical asymptotic justification. Therefore, their application for finite samples of 
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reasonable size may lead to drastically inaccurate results, especially in sophisticated and 
complex models. Successful implementation heavily depends on how well we understand 
structure of the model as well as on intensive experimentation. It can also be achieved 


by combining sampling with other inference methods. 


Metropolis-Hastings Algorithm. The Metropolis-Hastings (MH) algorithm (Hast- 
ings (1970)) is one of the first ways to construct Markov chains within MCMC. The MH 


method constructs a transition operator T from two components: 


1. A transition kernel q specified by the user. In practice, the distribution q(x’ | æ) 


can take almost any form and very often it is a Gaussian distribution centered at x. 


2. An acceptance probability for moves proposed by q, specified by the algorithm as 


plæ)a(a' | 2)) | 


A(x | x) = min (i ana 


At each step, if the Markov chain is in the state 2, then we choose a new point a’ 
according to the distribution q. Then, we either accept this proposed change with the 
probability a = A(a’ | a), or with the probability 1 — œ we remain at our current state. 
Notice that the acceptance probability encourages the chain to move towards more likely 
points in the distribution (imagine for example that q is uniform); when q suggests that 
we move into a low-probability region, we follow that move only a certain fraction of time. 
Given any g the MH algorithm ensures that p is a stationary distribution of the resulting 
Markov Chain. More precisely, p will satisfy the detailed balance condition with respect 
to the Markov chain generated by MH algorithm. This is a straight consequence of 
the definition of A(x’ | æ). 

As the result we wish to build the Markov chain with a small correlation between subse- 
quent values, which allows to explore the support of the target distribution rather quickly. 
This correlation consists of two components. The higher the variance of q, the lower the 
correlation between the current state and the newly chosen one, and the lower the variance 
of q, the lower the correlation when we stay at the same state hitting the low-probability 
region. To choose a good kernel q we need to find good balance between the two. For 
multivariate distributions the covariance matrix for the proposal distribution should reflect 


the covariance structure of the target. 


Gibbs sampling. A widely-used special case of the Metropolis-Hastings methods is 
Gibbs sampling. It was first described in Geman and Geman (1984). Suppose we have 
a finite sequence of random variables X),...,X,. We denote the i-th sample as « = 


0) 


(xo, sad ey Starting with an arbitrary configuration 2 we perform the procedure 


below. 
Repeat until convergence for t = 1, 2,3,...: 


1. Set x ¢ x) 
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2. For each variable X; 


e Sample X! ~ P(X; | X) 
© Update x + (X1,...,X22,,X1, X85?,..., x0) 


e] a—1? w? 
3. Sete Ha 


By X_; we denote all the variables in our set except X;. At each epoch of the step 2 
only one site undergoes a possible change, so that successive samples for each iteration 
can differ in at most one coordinate. Note that at this step we use updated values of 
the variables for which we have already sampled new values. The sampling step is quite 
easy to perform because we only condition on variables from X;-th Markov blanket, which 
consists of its parents, children and other parents of its children. 

In Geman and Geman (1984) it was stated that the distribution of « converges 
to m as t + oo regardless of x. The only assumption is that we continue to visit each 
site which is obviously a necessary condition for convergence. As in case of any MCMC 
algorithm if we choose an arbitrary starting configuration there is a burn-in phase, for 
the list of intuitions on how to decide how many samples we want to discard see Casella 
and George (1992). To avoid the high correlation between successive samples in Gibbs 
sampler we can also take every r-th sample instead of all of them, which is rather a question 


of heuristics and experimenting. 


3.3 Learning probabilities in BNs for incomplete data 


Here we again consider categorical distributions. Suppose we observe a single incomp- 
lete case in our data, which we denote as d € D. Under the assumption of parameter 


independence, we can compute the posterior distribution of 8;; for our network as follows: 


p(9;; | d) = (1 — plpa? | d)){p(i;)} + X pla}, pad | d){p(Gi; | £}, pa’) }. 
k=1 
Each term in curly brackets in this equation is a Dirichlet distribution. Thus, unless 
both X; and all the variables in pa(X;) are observed in case d, the posterior distribution 
of 0i; will be a linear combination of Dirichlet distributions, that is a Dirichlet mixture 
with mixing coefficients (1 — p(pa! | d)) and p(z*, pa’ | d), 1 < k < rj. See Spiegelhalter 
and Lauritzen (1990) for the details of derivation. 

When we observe a second incomplete case, some or all of the Dirichlet components 
in the previous equation will again split into Dirichlet mixtures. More precisely, the pos- 
terior distribution for 0;; will become a mixture of Dirichlet mixtures. As we continue to 
observe incomplete cases, where each case has missing values for the same set of variables, 
the posterior distribution for @;; will contain a number of components that is exponential 
in the number of cases. In general, for any interesting set of local likelihoods and priors, 
the exact computation of the posterior distribution for @ will be intractable. Thus, we 


require an approximation for incomplete data. 


47 


One of the possible ways to approximate is Monte-Carlo methods discussed previously, 
for example the Gibbs sampler, which must be irreducible and each variable must be 
chosen infinitely often. More specifically for our case, to approximate p(@ | D) given 
an incomplete data set we start with some initial states of the unobserved variables in 
each case (chosen randomly or otherwise) and as a result, we have a complete random 
sample De. Then we choose some variable X;[{] (variable X; in case l) that is not observed 
in the original random sample D, and reassign its state according to the probability 


P(Xin, De \ {2a}) 
Xen P(E De \ {24} 


where De \ zu denotes the data set De with observation x;; removed, and the sum in the de- 


distribution 


P(xy | De \ {2u}) = 


nominator runs over all states of the variable X;. Both the numerator and denominator 
can be computed efficiently as in (3.6). In the third step we repeat this reassignment for 
all unobserved variables in D, producing a new complete random sample D/.. The fourth 
step is to compute the posterior density p(0;; | D) as in (3.7) and, under the assumption 
of parameter independence, the joint posterior p(@ | D!) will be a product of all densities 
p(@;; | Di). Finally, we iterate through last three steps, and use the average of p(@ | D!) 
as our approximation. 

Monte-Carlo methods yield accurate results but they are often intractable, for example 
when the sample size is large. Another approximation that is more efficient than Monte- 
Carlo methods and often accurate for relatively large samples is the Gaussian approxi- 
mation. The idea is that for large amounts of data we can approximate the distribution 
p(0 | D) x p(D | @)p(@) as a multivariate-Gaussian distribution, namely 


p(@ | D) ~ p(D | Öp) ex» ( -30 -BH -8V7 ) 


where @ is the configuration of @ that maximizes g(@) = In(p(D | @)p(@)) and H is a 
negative Hessian of g(@). The vector @ is also called the maximum a posteriori (MAP) 
configuration of 0. There are various methods to compute the second derivatives proposed 
in literature (Meng and Rubin (1991), Raftery (1995), Thiesson (1995)). 

One more way to learn probabilities from incomplete data is the Expectation-Ma- 
ximization (EM) algorithm. It is an iterative algorithm consisting of two alternating 
steps - Expectation and Maximization. When the data is incomplete we cannot calculate 
the likelihood function as in (3.2) and (3.3). Now instead of maximizing likelihood or log- 
likelihood function we will be maximizing the expected log-likelihood of the complete data 
set with respect to the joint distribution for X conditioned on the assigned configuration 
of the parameter vector 6’ and the known data D. The calculation of the expected log- 
likelihood (Expectation step) amounts to computing expected sufficient statistics. For 
incomplete data the expected log-likelihood takes the following form 


qi Ti 


E[C(8) | D,0'] =S XX Mix log(@ ilk) 


i=l I=1 k=1 
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Nix = E(X, = xf, pa(X;) = pai) | D, 6 = Pew = a}, pa(X;) = pa; | dj, 6’). 


Here d; is possibly incomplete j-th case in D. When X; and all the variables in pa(X;) 
are observed, the term for this case requires a trivial computation: it is either zero or 
one. Otherwise, we can use any Bayesian network inference algorithm discussed above to 
evaluate the term. 

Having performed the Expectation step we want to find the new parameter vector, 
which is obtained by maximization of the expected log-likelihood (Maximization step). 
In our case we have new parameters on the r-th iteration 

r Nth 

us i Ñ, ilk l 
We start algorithm with an arbitrary (for example, random) parameter configuration 0° 
and iteratively perform two steps described above until the convergence. Dempster et al. 
(1977) showed that, under certain regularity conditions, iterations of the expectation and 


maximization steps will converge to a local maximum. 


3.4 Learning parameters for CTBNs 


The new method we propose in next chapters for learning CTBNs is capable of performing 
both tasks of parameter learning and structure learning simultaneously, although naturally 
these tasks can be performed separately. In this section we review selected methods 


focused only on parameter learning. 


3.4.1 Data 


In this thesis we discuss both complete and incomplete data. In essence, CTBN models 
the joint trajectories of its variables, hence having complete, or fully observed, data means 
that for each point in time of each trajectory, we know the full instantiation to all variables. 

D = {o[l],...,0{/m]} we denote a data set of trajectories. In case of complete 
data each gfi] is a complete set of state transitions and the times at which they occurred. 
Another way to specify each trajectory is to assign a sequence of states x; € Val(X), each 
with an associated duration. 

In contrast to the definition of complete data, an incomplete data set can be repre- 
sented by a set of one or more partial trajectories. A partially observed trajectory o € D 
can be specified as a sequence of subsystems S; of X, each with an associated duration. 
A subsystem S describes the behaviour of the process over a subset of the full state space, 
i.e. Val(S) C Val(X). It is simply a nonempty subset of states of X, in which we know 
the system stayed for the duration of the observation. Some transitions are partially ob- 


served, i.e. we know only that they take us from one subsystem to another. Transitions 
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from one state to another within the subsystem are fully unobserved, hence, we do not 
know how many transitions there are inside of a particular subsystem nor when they do 


occur. 


3.4.2 Learning parameters for complete data 


Recall, that CTBN M consists of two parts. The first is an initial distribution PX, specified 
as a Bayesian network over X. The second is a continuous transition model, specified as 
a directed (and possibly cyclic) graph and a set of conditional intensity matrices (CIM), 
one for each variable X; in the network. For the purposes of this section we abbreviate 
pag(X;) as pa(.X;) and we denote CIMs as Qx;|pa(x;) Recall that each Qx;|pa(x;) consists 
of intensity matrices Qx,\pa,, where pa, is a single configuration of pa(X;). Strictly 
speaking, pa, is one of the possible parent configurations paj,..., pa? similar to (3.1). 
In terms of pure intensity parameterization we denote elements of these matrices as qz2"|pa, 
and qz\pa,. Note, that by Theorem 2.9 we can divide the set of parameters in terms of 
mixed intensity into two sets. Then for each variable X; and each instantiation pa, of its 
set of parents pa(X;) the parameters of Qy,\pa(x,) Will be qx, = {dz\pa, : £ E€ Val(X;)} 
and Ox, = {Or2/\pa, : 2,2’ € Val(X;), x # x'}. More precisely, for each X; and every 
x € Val(X;) we have 

Wea! |pa; 1 1 

Oca ee dew xv E Val( Xi), TET. 
The learning problem for the initial distribution is a Bayesian network learning task, 
which was discussed previously in this chapter. Therefore, it remains to learn the vector 


of parameters (q, 0). 


Likelihood estimation. Let us start from a fully observed case and a single homo- 
geneous Markov process X(t). As all the transitions are observed, the likelihood of D 
can be decomposed as a product of the likelihoods for individual transitions d. Let 
d = (z4, ta, £4) € D be the transition where X transitions to state x} after spending 
the amount of time tg in state xg. Using the mixed intensity parameterization, we can 


write the likelihood for the single transition d as 
Lx(q,9: d) =Lx(q: d)Lx(@: d) = qr, exP(—Geyta) ` Orari 


Then multiplying the likelihoods for each transition d in our data D we can summarize 
it in terms of sufficient statistics T[x] which describes the amount of time spent in each 
state x € Val(X) and M[z,2'] which encodes the number of transitions from x to 2’, 


where x # x’ as follows: 


Lx(q,0:D) = 0 Lx(q: D) (u Ly(0: D) 


dED dED 


= (11 git epal) (1 I] age) 


x x alta 


(3.16) 
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where M[z] = >> M[z, 2’). 

Now in case of CTBNs, each variable X of the network M is conditioned on its par- 
ent set Pa = pag(X), and each transition of X must be considered in the context of 
the instantiation pa of Pa. With complete data, we know the value of Pa during the en- 
tire trajectory, so at each point in time we know precisely which homogeneous intensity 
matrix Qx|pa governed the dynamics of X. 

Thus, the likelihood decomposes into the product of likelihoods, each corresponding 


to the variable in the network, as 


Ly(q,9:D) =|] Lx;(@x,\u;> 9x,\u; : D) =|] Lx;(@x,u, : P)Lx,(8x,\u; : D). 
XiEX X;EX 
The term Lx (@x\pa : D) is the probability of the sequence of state transitions, disregarding 
the times between transitions. These state changes depend only on the value of the parents 
at the moment of the transition. For each variable X € X let M|z,z2' | pa] denote 
the number of transitions from X = x to X = 2’ while Pa = pa. Then, with this set of 
sufficient statistics M|zx,x' | pa], we have 


Lx(Oxwa:D) = JI JI JI ina 


pa x a/Azxr 


The computation of Lx (qx\pq : D) is more subtle since the duration in the state can be 
terminated not only due to a transition of X, but also due to a transition of one of its 
parents. The total amount of time where X = x and Pa = pa can be decomposed into 
two different kinds of durations T[x | pa] = T,[x | pa] + T.[x | pa], where T,[x | pa] is 
the total length of the time intervals that terminate with X remaining equal to x, and 
T..|x | pa] is the total length of the time intervals that terminate with a change in the value 
of X. However, it is easy to show that we do not need to maintain the distinction between 
the two of them and we can use the set of T'|x | pa] as sufficient statistics. 

Finally, we can write the log-likelihood as a sum of local variable likelihoods of the form 


lx(q,0 : D) = x(q ; D) +lx(0 é D) = 


+Y X Me, x’ | pa) log 622\pa 


pa x g/fx~a 


=| 5 Mi T | pal log da|pa — e|\pal” [x | pal 


pa x 


(3.17) 


Now we can write the maximum-likelihood (MLE) parameters as functions of the sufficient 
statistics as follows (for the proof see Nodelman (2007)): 


dz|pa = 


’ zx'|pa — 


M[zx | pal A M{z, x | pa] 
T[z | pal M{x | pal ` 


The Bayesian approach. The other way to estimate parameters in case of fully ob- 
served data is the Bayesian approach. To perform Bayesian parameter estimation, simi- 


larly to the case of Bayesian networks, for computational efficiency we use a conjugate 
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prior (one where the posterior after conditioning on the data is in the same parametric 
family as the prior) over the parameters of our CTBN. 

For a single Markov process we have two types of parameters, a vector of parameters 0 
for categorical distribution and q for exponential distribution. An appropriate conjugate 
prior for the exponential parameter q is the Gamma distribution P(q) = Gamma(a, T), 
and as we mentioned in Section 3.1, the standard conjugate prior to categorical distri- 
bution is a Dirichlet distribution P(@) = Dir(Qz2,,...,Qzx,). The posterior distribu- 
tions P(@ | D) and P(q | D) given data are Dirichlet and Gamma distributions, respec- 
tively. 

In order to apply this idea to an entire CTBN we need to make two standard assump- 


tions for parameter priors in Bayesian networks, global parameter independence: 


P(q,0) = ] | P(4xjpag(x): 9 x1pag(x)) 


XEX 


and local parameter independence for each variable X in the network: 


P(qx|Pa, 9x\Pa) = TIT") (TIT. 


x pa x pa 


If our parameter prior satisfies these assumptions, so does our posterior, as it belongs 
to the same parametric family. Thus, we can maintain our parameter distribution in 
the closed form, and update it using the obvious sufficient statistics M |x, x’ | pa] for Ozipa 
and M|z | pa], T[x | pa] for qzjpa. 

Given a parameter distribution, we can use it to predict the next event, averaging out 
the event probability over the possible values of the parameters. As usual, this prediction 
is equivalent to using “expected” parameter values, which have the same form as the MLE 


parameters, but account for the “imaginary counts” of the hyperparameters: 


Qzjpa + M[z | pal A _ Qa!jpa + M|z, 2' | pal 
Tz|pa + T|z | pal Qz|pa + M|x | pal 


qz|pa = 


’ Oxs!\pa = 


Note that, in principle, this choice of parameters is only valid for predicting a single 
transition, after which we should update our parameter distribution accordingly. However, 
as is often done in other settings, we can approximate the exact Bayesian computation by 
“freezing” the parameters to these expected values, and use them for predicting an entire 


trajectory. 


3.4.3 Learning parameters for incomplete data 


Recall, that in case of Bayesian networks one of the methods to deal with missing data 
was Expectation-Maximization (EM) algorithm. Here we provide a concise description 
of the algorithm based on EM for CTBNs presented in detail in Nodelman et al. (2012). 
We start again with reviewing the EM scheme for a single Markov process X, which is 
the basis of the algorithm for CTBNs. Let D = {o[1],...,a[m]} denote the set of all 
partially observed trajectories of X. 
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For each partial trajectory oļi] € D we can consider the space H[i] of possible comple- 
tions of this trajectory. For every transition of gfi] each completion h[i] € H{i] specifies 
which underlying transition of X occurred. Also it specifies all the entirely unobserved 
transitions of X. Combining o[i] and hļi] gives us a complete trajectory o*[i] over X. 
Note that, in a partially observed trajectory, the number of possible unobserved transi- 
tions is unknown. Moreover, there are uncountably many times at which each transition 
can take place. Nevertheless, we can define the set Dt = {o7[1],...,07[m]} of comple- 
tions of all of the partial trajectories in D. For examples of completions see Nodelman 
et al. (2012). 

As we mentioned in the previous subsection, the sufficient statistics of the set of 
complete trajectories Dt for a Markov process are T(x], the total amount of time that X 
stays in x, and M|z, x'], the number of times in which X transitions from x to x’. Applying 
logarithm to (3.16) we can write the log-likelihood x(q, 0 : DT) for X as an expression 
of these sufficient statistics. 

Let r be a probability density over each completion in Hfi] which, in turn, yields 
a density over possible completions of the data Dt. We can write the expectations of 
the sufficient statistics with respect to the probability density over possible completions 
of the data as T[z], M[x,x'] and M[z]. These expected sufficient statistics allow us to 
write the expected log-likelihood for X as 


E, [€x(q,0 : D*)] = E,[éx(q : D*)] + Ex[éx(6 : D*)] = 


is (mu ln(qs) — qT [z] + SS M{z, 2'| 0x) 


g Fr 


x 


Now we can use the EM algorithm to find maximum-likelihood parameters q,0 of X. 
The EM algorithm begins with an arbitrary initial parameter assignment, q?, 0°. It then 
repeats the two steps, Expectation and Maximization, updating the parameter set, until 
convergence. After the k-th iteration we start with parameters q*, 0". The Expectation 
step goes as following: using the current set of parameters, we define for each ofi] € D, 
the probability density r*(h[i]) = p(h[i] | ali], q, 0"). We then compute expected suffi- 
cient statistics T[z], M[a, x’] and M|z] according to this posterior density over completions 
of the data given the data and the model. Using the expected sufficient statistics we just 
have computed as if they came from a complete data set, we set q*+! and 6**' to be 
the new maximum likelihood parameters for our model as follows 
k+1 _ M[z] g+ M{z, 2'] 


z Ts)’ se = Mle (3.18) 


The difficult part in this algorithm is the Expectation Step. The space over which we are 
integrating is highly complex, and it is not clear how to compute the expected sufficient 
statistics in a tractable way. 

In Nodelman et al. (2012) and Nodelman (2007) authors provided in detail the algo- 
rithm on how to compute expected sufficient statistics for an n-state homogeneous Markov 


process X, with intensity matrix Qx with respect to the posterior probability density over 
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completions of the data given the observations and the current model. The statistics are 
computed for each partially observed trajectory o € D separately and then the results 
are combined. 

A partially observed trajectory ø is given as a sequence of N subsystems so that 
the state is restricted to subsystem S; during the interval fti, ti+1) fr 0 <i < N—-1. 
To conduct all the necessary computations, for each time t, the forward and backward 
probability vectors a; and 6; are defined, which include evidence of any transition at time t, 


and also vectors a; and 87, neither of which include evidence of a transition at time t. 


~ 


The total expected time E[T[j]] is obtained by summing the integrals over all intervals 


of constant evidence [v, w) with the subsystem S to which the state is restricted on that 
interval. Each integrand is an expression containing a,, By and Qs. The computations for 
each integral are performed via the Runge-Kutta method of fourth order with an adaptive 


step size. 


Regarding the expected number of transitions E[M|z, x']] from the state x to x’ dis- 


crete time approximations of M|,2’] are considered which in the limit as the size of 
the discretization goes to zero yields an exact equation. As a result we get the sum of 
expressions where each summand is associated with a time interval. The overall expres- 
sion for the expected number of transitions consists of two parts: the sum of products 
corresponding to intervals with partially observed transitions and containing a; and 6; 
for different time points t and the sum of integrals of practically identical form to those 
obtained for total expected time. 

In order to compute a; and 3; a forward-backward style algorithm (Rabiner and Juang 
(1986)) over the entire trajectory is used to incorporate evidence and get distributions over 
the state of the system at every time t;. If needed it is possible to exclude incorporation of 
the evidence of the transition from either forward or backward vector and also obtain a; 
and 8. We can then write the distribution over the state of the system at time t given 
all the evidence. 

Continuous time Bayesian networks are a factored representation for homogeneous 
Markov processes, hence, extending the EM algorithm to them involves making it sensitive 
to a factored state space. As mentioned previously, the log-likelihood decomposes as 
the sum of local log-likelihoods for each variable. With the sufficient statistics Tx | pal, 
M[a,x’ | pa] and M{z | pa] of the set of complete trajectories D* for each variable X in 
CTBN WN the likelihood for each variable X further decomposes as in (3.17). By linearity 


of expectation, the expected log-likelihood function also decomposes in the same way. 


So we can write the expected log-likelihood E,[{¢(q,@ : D*)] as a sum of terms, one for 


each variable X, in a similar form as (3.17), but using the expected sufficient statistics 
T\x | pa], M[z, 2’ | pa] and M[z | pal. 

The EM algorithm for CTBNs is essentially the same as for homogeneous Markov 
processes. We need only specify how evidence in the network induces evidence on the in- 
duced Markov process, and how expected sufficient statistics in the Markov process give 
us the necessary sufficient statistics for CTBN. 

The Maximization step is practically the same as in (3.18), we just use proper expected 
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sufficient statistics for the CTBN case: 


k41 _ Mz | pal Rit aa M|z, 2’ | pal 


“pa Tle |pal’ =a Mle | pal 


The Expectation step is again more difficult and could be done by flattening the CTBN 
into a single homogeneous Markov process with a size of the state space exponential 
in the number of variables. Then we follow the method described above. However, as 
the number of variables in the CTBN grows the process becomes intractable, so we are 
forced to use approximate inference. 

We want this approximate algorithm to be able to compute approximate versions of 
the forward and backward messages a; and 8, and extract the relevant sufficient statistics 
from these messages efficiently. In the next subsection we review a cluster graph infe- 
rence algorithm which can be used to perform this type of approximate inference. Using 


obtained cluster beliefs (see below) we can compute a;,,, and 6, and use them in the 


i+1 
forward-backward message passing procedure. The cluster distributions are represented 
as local intensity matrices from which we can compute the expected sufficient statistics 


over families X;, pag(X;) as described above. 


3.5 Inference for CTBNs 


To gain the perspective on the whole concept of continuous time Bayesian networks and 
their power, similarly to Bayesian networks, we discuss the questions of inference although 
it is not the key subject of this thesis. We start with a discussion of the types of queries 
we might wish to answer and the difficulties of the exact inference. 


Inference for CTBNs can take a number of forms. The common types of queries are: 


e querying the marginal distribution of a variable at a particular time or also the time 


at which a variable first takes a particular value, 


e querying the expected number of transitions for a variable during a fixed time in- 


terval, 


e querying the expected amount of time a variable stayed in a particular state during 


an interval. 


Previously we showed that we can view CTBN as a compact representation of a joint 
intensity matrix for a homogeneous Markov process. Thus, at least in principle, we can 
use CTBN to answer any query that we can answer using an explicit representation of 
a Markov process: we can form the joint intensity matrix and then answer queries just as 
we would do for any homogeneous Markov process. 

The obvious flaw is that this approach for answering these queries requires us to 
generate the full joint intensity matrix for the system as a whole. The size of the matrix 


is exponential in the number of variables, making this approach generally intractable. The 
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graphical structure of the CTBN immediately suggests that we perform the inference in 
a decomposed way, as in Bayesian networks. Unfortunately, the problems are significantly 
more complex in this setting. 

In Nodelman et al. (2002) the authors describe an approximate inference algorithm 
based on ideas from clique tree inference, but without any formal justification for the algo- 
rithm. More importantly, the algorithm covers only point evidence, meaning observations 
of the value of a variable at a point in time, but in many applications we observe a variable 
for an interval or even for its entire trajectory. Therefore, we shortly describe an approx- 
imate inference algorithm called Expectation Propagation (EP) presented in Nodelman 
et al. (2005) that allows both point and interval evidence. The algorithm uses message 
passing in a cluster graph (with clique tree algorithms as a special case), where the clus- 
ters do not contain distributions over the cluster variables at individual time points, but 
over trajectories of the variables through a duration. 

As we discussed in this chapter, in cluster graph algorithms we construct a graph 
whose nodes correspond to clusters of variables and then pass messages between these 
clusters to produce an alternative parameterization, in which the marginal distribution of 
the variables in each cluster can be read directly from the cluster. In discrete graphical 
models, when the cluster graph is a clique tree, two passes of the message passing algorithm 
produce the exact marginals. In generalized belief propagation (Yedidia et al. (2001)), 
message passing is applied to a graph which is not a clique tree, in which case the algorithm 
may not converge, and produces only approximate solutions. There are several forms of 
message passing algorithm as we have discussed in Subsection 3.2.2. The algorithm of 
Nodelman et al. (2005) is based on multiply-marginalize-divide scheme of Lauritzen and 
Spiegelhalter (1988), which we now briefly review. 

A cluster graph is defined in terms of a set of clusters C;, whose scope is some subset 
of the variables X. Clusters are connected to each other by edges, along which messages 
are passed. The edges are annotated with a set of variables called a sepset S; j, which 
is the set of variables in C; N C;. The messages passed over an edge between C; and Cj 
are factors over the scope 5;,;. Each cluster C; maintains a potential 6;, which is a factor 
reflecting its current beliefs over the variables in its scope. Each edge similarly maintains 
a message /1;,; Which encodes the last message sent over the edge. The potentials are 
initialized with a product of some subset of factors parameterizing the model (CIMs in 
our setting). Messages are initialized to be uninformative. Clusters then send messages to 
each other, and use incoming messages to update their beliefs over the variables in their 
scope. The message m,;_,; from C; to C; is the marginal distribution S; ; according to bi. 
The neighbouring cluster C; assimilates this message by multiplying it into 6;, but avoids 
double-counting by first dividing by the stored message u; j. Thus, the message update 
takes the form 8; <— 6; - oe. 

In the algorithm the cluster beliefs represent not the factors over values of random 
variables themselves, but rather cluster potentials and messages both encode measures 
over entire trajectories of the variables in their scope. The number of parameters grows 


exponentially with the size of the network, and thus we cannot pass messages exactly 
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without giving up the computational efficiency of the algorithm. To address this issue 
Nodelman et al. (2005) used the expectation propagation (EP) approach of Minka (2001), 
which performs approximate message passing in cluster graphs. In order to get an approx- 
imate message each message Mi; is projected into a compactly representable space so as 
to minimize the KL-divergence between the message and its approximation. To encode 
the cluster potentials CIMs are used. In order to apply the EP algorithm to clusters of 
this form some basic operations over CIMs need to be defined. They include CIM product 
and division, approximate CIM marginalization, as well as incorporating the evidence into 
CIM. 

The message propagation algorithm is first considered for one segment of the trajectory 
with constant continuous evidence. Exactly the same as for Bayesian networks, this 
process starts with constructing the cluster tree for the graph G. Note that cycles do not 
introduce new issues. We can simply moralize the graph connecting all parents of a node 
with undirected edges and then make all the remaining edges undirected. If there is 
a cycle, it simply turns into a loop in the resulting undirected graph. Next we select a set 
of clusters C;. These clusters can be selected so as to produce a clique tree for the graph, 
using any standard method for constructing such trees. We can also construct a loopy 
cluster graph and use generalized belief propagation. We did not discuss this topic in 
the thesis (for more details see Koller and Friedman (2009)). The message passing scheme 
described in this section is the same in both cases. 

The algorithm iteratively selects an edge connecting the clusters C; and C; in the clus- 
ter graph and passes the message from the former to the latter. In clique tree propagation 
the order in which we chose edges was basically fixed, meaning that we started from leaves 
to roots performing an upward pass and then going in the opposite direction. In genera- 
lized belief propagation, we might use a variety of message passing schemes. Convergence 
occurs when messages cease to affect the potentials which means that neighboring clus- 
ters C; and C; agree on the approximate marginals over the variables from Sij. 

Now we can generalize the algorithm for a single segment to trajectories containing 
multiple segments of continuous evidence. Nodelman et al. (2005) applied this algorithm 
separately to every segment, passing information from one segment to the next one in 
the form of distributions. More precisely, consider a trajectory defining a sequence of 
time points t),...,tn, with constant continuous evidence on every interval [t;, t;}1) and 
possible point evidence or observed transition at each t;. Then a sequence of cluster 
graphs over each segment is constructed. Starting from the initial segment EP inference 
is run on each cluster graph using the algorithm for a single segment described above, 
and the distribution at the end time point of the interval is computed. The resulting 
distribution is then conditioned on any point evidence or the observed transition, and 
next used as the initial distribution for the next interval. 

However, there is one subtle difficulty relating to the propagation of messages from 
one interval to another. If a variable X appears in two clusters C; and C; in a cluster 
graph, the distribution over its values in these two clusters is not generally the same, even 


if the EP computation converges. The reason is that even calibrated clusters only agree 
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on the projected marginals over their sepset, not the true marginals. To address this issue 
and to obtain a coherent distribution which can be transmitted to the next cluster graph 
the individual cluster marginals and sepsets for the state variables at the end time point 
of the previous interval are recalibrated to form a coherent distribution (the conditioning 
on point evidence can be done at the same time if needed). Then we can extract the new 
distribution as a set of calibrated cluster and sepset factors, and introduce each factor 
into the appropriate cluster or sepset in the cluster graph for the next time interval. 
The above algorithm performs the propagation of beliefs forward in time. It is also pos- 
sible to do a similar propagation backwards and pass messages in reverse, where the cluster 
graph for one time interval passes a message to the cluster graph for the previous one. 
Also to achieve more accurate beliefs we can repeat the forward-backward propagation 
until the entire network is calibrated, essentially treating the entire network as a single 
cluster graph. Note that since one cluster graph is used for each segment of fixed con- 
tinuous evidence, then each cluster will approximate the trajectory of all the variables it 
contains as a homogeneous Markov process for the duration of the entire segment. There- 
fore, the choice of segments and the resulting subsets of variables, over which we compute 


the distribution, determine the quality of the approximation. 
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Chapter 4 


Structure learning for Bayesian 
networks 


Recall the Definition 2.3 of Bayesian Networks (BN), the notion of which combines 
the structure given by a Directed Acyclic Graph (DAG) and the probability distribu- 
tion encoded by Conditional Probability Distributions (CPD). By far, in Chapter 3 we 
discussed the problem of finding CPDs and making the inference given the structure. In 
this chapter we will discuss the problem of learning the structure of Bayesian networks. 
In Section 4.1 we briefly review known approaches to the problem. In Section 4.2 we 
recall partition MCMC algorithm for learning the structure of the network, whose part 
concerning the division of the graph into layers will be the first step of our new method. 
In Sections 4.3 and 4.4 we present a novel approach to structure learning with the use of 
the above algorithm and LASSO approach for continuous and discrete data, respectively. 


Section 4.5 is dedicated to numerical results. 


4.1 Problem of learning structure of Bayesian Networks 


Structure learning is known to be a hard problem, especially due to the superexponential 
growth of the DAG space when the number of nodes is increasing. Generally speaking 
the literature on the structure learning can be divided into three classes: constraint- 
based methods, score-and-search algorithms and the dynamic programming approach (as 
discussed for example in Koller and Friedman (2009)), even though this division is not 
that strict. The contents of this section come mostly from Kuipers and Moffa (2017) and 
Daly et al. (2011). 

Constraint-based methods use conditional independence tests to obtain information 
about the underlying causal structure. They start from the full undirected graph and 
then make decisions about removing the edge in the network based on tests of conditional 
independence. The widely used algorithm of this nature, PC algorithm (Spirtes et al. 
(2000) ), and constraint-based methods in general are sensitive to the order in which they 
are run. However Colombo and Maathuis (2014) proposed some modifications for PC 
algorithm to remove either partially or altogether this dependence. These methods scale 
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well with the dimension but are sensitive to local errors of the independence tests which 
are used. 

One of the most widely studied ways of learning a Bayesian network structure has 
been the use of so-called ’score-and-search’ techniques. These algorithms comprise of: 


e a search space consisting of the various allowable states, each of which represents 


a Bayesian network structure; 
e a mechanism to encode each of the states; 
e a mechanism to move from state to state in the search space; 


e ascoring function assigning some score to a state in the search space which describes 


the goodness of fit with the sample data. 


Also some hybrid methods combining ideas from both techniques were proposed, for 
example the max-min-hill-climbing of Tsamardinos et al. (2006). 

Within the family of search and score methods we can distinguish a separate class 
of MCMC methods for the graph space exploration. Their main and huge advantage is 
that they can provide a collection of samples from the posterior distribution of the graph 
given the data. This means that rather than making the inference based on a single 
graphical model, we can account for model uncertainty by averaging over all the models 
in the obtained class. In particular, we can estimate the expectation of any given network 
feature, such as the posterior probability of an individual edge, by averaging the posterior 
distributions under each of the models, weighted by their posterior model probabilities 
(Madigan et al. (1995), Kuipers and Moffa (2017)). This is especially important in high 
dimensional domains with sparse data where the single best model cannot be clearly 
identified, so the inference relying on the best scoring model is not justified. 

The first MCMC algorithm over graph structures is due to Madigan et al. (1995), 
later refined by Giudici and Castelo (2003). To improve on the mixing and convergence, 
Friedman and Koller (2001) instead suggested to build a Markov chain on the space of 
node orders, at the price of introducing a bias in the sampling. For smaller systems with 
smaller space and time complexity one of the efficient approaches is the dynamic prog- 
ramming (Koivisto and Sood (2004)), which can be further used to extend the proposals 
of standard structure MCMC approach in a hybrid method (Eaton and Murphy (2007)). 
Within the MCMC approach, to avoid the bias while keeping reasonable convergence 
rate, Grzegorczyk and Husmeier (2008) more recently proposed a new edge reversal move 
method combining ideas both of standard structure and order based MCMC. Recently 
Kuipers and Moffa (2017) presented another MCMC algorithm designed on the combi- 
natorial structure of DAGs, with the advantage of improving convergence with respect 
to structure MCMC, while still providing an unbiased sample since it acts directly on 
the space of DAGs. Moreover, it can also be combined with the algorithm of Grzegorczyk 


and Husmeier (2008) to improve the convergence rate even further. 


60 


4.2 Partition MCMC method 


In this section we describe the Partition MCMC algorithm of Kuipers and Moffa (2017), 
which will be the base of our novel method for learning the structure of BNs. This algo- 
rithm considers combinatorial representation of DAGs to build an efficient MCMC scheme 
directly on the space of DAGs. Its convergence is better than that of the structure MCMC 
and does not introduce bias as the order based MCMC. As we mentioned, the authors 
also proposed a way to combine their method with the new edge reversal move approach 
of Grzegorczyk and Husmeier (2008) and improve upon their MCMC sampler. 

First we need to introduce the notion of layers and partitions for DAG. Given DAG 
G = (V,€) we define layers 4; of the nodes (called interchangeably variables) in the network 


as follows: 
e lo = {v E V : pag(v) = Ø} is the layer of the nodes which do not have any parents; 


e having defined the layer 4; for i = 0,1,...,k — 1 we define the next layer as 


lp = {v E V : Jw E lı such that w € pag(v) and pag(v) C Lk-1}, 


where Lg = U 4. 
i<k-1 
Note that variables from the same layer do not have arrows between them, and that each 
variable (except for the layer @)) has at least one arrow directed towards it from any 
variable from the adjacent previous layer. For instance, the graph in Figure 4.1 has three 
layers: fp = {1,3,5}, 4&4 = {4} and %2 = {2}. 
Suppose that for some arbitrary graph we have q+1 layers. Each layer 4; has a certain 


amount k; of nodes, which in sum gives the total number of nodes d, i.e. > ki = d. 
In addition, with each layer representation there is associated a namia al nodes, 
where we list nodes in the layer order. More precisely, first we write nodes from the first 
layer, then from the second one, etc. For the graph in Figure 4.1 we have the partition 
A = [8,1,1] and the permutation 7), = [1, 3,5,4,2]. Together a pair (A,7)) is called a 
labelled partition. 

Kuipers and Moffa (2017) proposed an efficient MCMC algorithm for exploring the spa- 
ce of partitions to find the most probable layer representation given the observed data. 
Although the full algorithm is suited for structure learning, we want to improve on this 
algorithm and replace the second part of it with the LASSO estimator. The authors 
define an MCMC algorithm on the space of node partitions avoiding in this way over- 
representation of certain DAGs. Compared to other MCMC methods mentioned above 
partition MCMC is faster than structure MCMC of Madigan et al. (1995). It is slower than 
order MCMC of Friedman and Koller (2001) but does not introduce any bias. The basic 
move consists of splitting one element of the partition (i.e. layer) into two parts or joining 
two adjacent elements (the authors also propose an additional move consisting of swapping 


two nodes in adjacent layers). All the partitions reachable from a given partition in 
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Figure 4.1: An example of partition representation of the DAG. 


one basic move are called the neighbourhood of the partition. So the MCMC scheme 
consists of sampling a partition from the neighbourhood of the previous partition with 
a small probability to stay still defined by the user. The obtained partition is scored and 
the score coincides with the posterior probability of the labelled partition. After sampling 
the partition we sample a single DAG weighted according to its posterior. Then we can 
average the acquired DAGs in the MCMC chain and choose the model. However, we 
propose to change the step where we sample DAG from the posterior distribution and 
average DAGs from the MCMC chain. It is well suited for inference and estimation of 
network parameters but we believe that we can improve the Bayesian averaging approach 
in the case of structure learning. We propose to use partition MCMC for finding the best 
scoring partition and next to use it for recovering arrows with the LASSO estimator where 


each parameter corresponds to a certain arrow in the network. 


4.3 The novel approach to structure learning 


We want to combine advantages of partition MCMC and LASSO for linear models. First 
we find the best layer representation using partition MCMC algorithm. Next we obtain 
the final DAG solving d LASSO problems, where d is the number of variables (nodes). 
Having found the most probable layer representation for a DAG we consider two models: 


one for continuous data and one for discrete data. 


4.3.1 Gaussian Bayesian Networks 


For the continuous case we consider Gaussian Bayesian Networks (GBN) introduced in 
Section 3.1. We denote as X/” the m-th random variable in the i-th layer, where m € 
{1,..., ki}. We assume that each e has the normal distribution M(0,07"). We also 
assume that each e€; is independent of all X/". Now given the partition [ko, ki,..., ką] 
we can write the problem of finding the DAG structure as a set of the following d linear 
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model problems: 


1_ al 1 
Xo = boo + & 


xn = Ke 2 + ek 
Xi = Bl, + Bax Pane DAN A el 


= Pht XS + piste ho 4. eb (S1 


= b20 + bzo Xe + -+ ohh po xko gx groar Be +e 


xh = iy + 2 Bret XP? + ee, 
ie 


Then the problem of finding DAG’s structure is equivalent to the problem of finding 


MI, Mi 


non-zero parameters (3; ; This corresponds to starting from the full possible graph 
and removing non-existing edges by shrinking the parameters to 0. It is possible due to 
the fact that we have a partition, where we know which nodes can be parents for which 
nodes. This would not be possible otherwise because the graph has to be acyclic and 
we would have to introduce other constraints to the optimization problem. To solve this 
problem we will apply LASSO regression to each linear model, which tends to shrink 
the coefficients to 0 by penalizing those coefficients with ¢;-norm. 

Let m be the number of observations. Let (Xi) be the matrix of observations, 
where i € {1,...,m} and k € {1,...,d}. Then by XÍ we denote the matrix formed 
by the columns corresponding to the j-th layer. Similarly, by X°%-) we denote the ma- 
trix which consists of the columns of the matrix of observations corresponding to the first 
j layers. By X%{i] we denote the column of the matrix X’ corresponding to the i-th 
variable from the j-th layer and by X; we denote the row of the matrix (Xj) correspond- 
ing to the i-th observation. Moreover, let Bi = eee PA a Haale where 
j= {1,...,q} andi = {1,...,k;}. To find the required vectors Bi we solve the following 


d optimization problems 


Bi = argmin [RSS,;(6) + A;,||@la], (4.2) 


ge tot +hj-1 
where RSS; ;(0) = 1/2||X?[i] — Co lz is a residual sum of squares for the i-th 
variable in the j-th layer and ||@||, = > > |@7""| is the l;-norm of 6. Note, that 0 


0 m= 1 
depends both on j and i. The tuning Steere Aji > 0 balance the minimization of 


the cost function and the penalty function. The form of the penalty is crucial, because 
its singularity at the origin implies that some coordinates of the minimizer Bi are exactly 
equal to 0 if Aj; is sufficiently large. Thus, starting from the graph with all possible 


arrows for the given layer representation (i.e. there are arrows from variables on each 
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layer towards all the variables in next layers) we remove irrelevant edges. The functions 
RSS; ;(@) and the penalty are convex, so (4.2) is a convex minimization problem. This is 
an important fact from both practical and theoretical points of view. 


4.3.2 Theoretical results for GBNs 


By Sj, we denote the support of the true vectors of parameters b, i.e. the sets of non- 
zero coordinates of each 6}, and by S = {Sidi tsy DL hig S21 +--> Ogg, pe Moreover, ic ri 
is the smallest in the absolute value element of 8; restricted to Sj. The set 87; denotes 
the complement of S;;, that is the set of zero coordinates of 6}. For any vector a we 
denote its lo-norm by |la||,. = max, |a,|. For a vector a and a subset of indices I by ay 
we denote the vector a restricted to its coordinates from the set J, i.e. (ar); = a; for 
i € I and (ar); = 0 otherwise. Moreover, |I| denotes the number of elements of J. For 
a vector a = (a1, ...,an) by Cov(a) we denote the matrix (c;j), where c = Var(a;) and 
Cij = Cov(ai, aj). 

Before we state the main results of this chapter we introduce the cone invertibility 
factor (CIF), which plays an important role in the theoretical analysis of the properties of 
LASSO estimators. In literature there are three related notions which are commonly used 
in said analysis and help to provide some constraints on the optimized function so that 
the estimator is “good” in certain sense. These notions are the cone invertibility factor, 


compatibility factor and restricted eigenvalue (see Huang et al. (2013) and references 


therein). For any € > 1 we define the cones C(E, Sji) = fo : [Ase lla < Ells; i}. Then 
CIF is defined as EA 
_ j 
F,,(€)= inf =, 4.3 
ds (£) 0#0EC(E,S;,1) WAl\ oo ( ) 
where Xf is the covariance matrix for a random vector (Xj,...,Xj~-1) of variables from 


the first 7 layers. More precisely, 


yal (XOG-D)* XOG-2), 
m 

Our goal will be to show that the estimators Bi are close to the true vectors b; in a certain 
sense. However, if the curvature of the function in (4.2) around p; is relatively small, 
then the closeness between its values at 8; and 6; does not necessarily imply the closeness 
between the arguments b3 and b; Hence, we require some additional conditions, for 
instance, strong convexity of RSS}; at Gi, i.e. that the smallest eigenvalue of its Hessian 
is positive. In the high-dimensional case it is too strong of an assumption, therefore 
one usually considers restricted strong convexity or restricted smallest eigenvalues, where 
“restricted” means that we take infimum over C(€,5j,;) instead of the whole space. CIF 
(4.3) is an example of such reasoning. We also introduce a non-random version F; ;(€) of 
CIF for each j € {1,...,q} as follows. First we define 


H? = Cov(X,[0: (j — 1), 
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where X,|0 : (j — 1)] denotes the restriction of X, to variables from the first j layers. 
We assume that each H; is positive definite and elements on the diagonal are equal to 1, 
ie. Hi =1, where i € {1,2,..., ki ++ + k;-1}. Then 


lE oll 


Filé) = in ; 4.4 

sal) 040€C(E,S;:) |]O|loo (+a) 
Since in a Gaussian Bayesian Network the joint probability of all variables is assumed to 
be Gaussian, then each marginal is Gaussian as well. Hence, for simplicity we can bound 


the variance for each variable by the same constant 7?. Also we denote 


m, — SPU + 6)" log(| L511" aks/e) 
j F?.(g) 


for each j € {1,...,q} and i € {1,... , kj}. 


(4.5) 


Theorem 4.1. Fix arbitrary € € (0,1) and £ > 1. Assume that Fj,;(&) defined in (4.4) is 
positive for each j € {1,...,q} andi € {1,...,k;}. In addition suppose that 


m > Kı max mji (4.6) 
J; 


and for each i and j we have 


EE elsa 


g= Mji 


for some universal constants Kı and Kə. Then 


i 4g Aji 
P (13 — Blloo < oe a >1-e. 

The second main result is about thresholded version of LASSO estimator. It will 
be proved after the proof of Theorem 4.1. Consider the Thresholded LASSO estima- 
tor with the sets of nonzero coordinates ae The set Sii contains only those coeffi- 
cients of the LASSO estimator (4.2), which are larger in the absolute value than some 
pre-specified threshold 6;; for each j € {1,...,q} and i € {1,..., kj}. We denote 
fii, boas Siki; ae ease Sar] as Ge. 


Corollary 4.2. Suppose that assumptions of Theorem 4.1 are satisfied. If for each j, i 


Aji , then Thresholded LASSO with 


and arbitrary € > 1 we have 8, /2 > ĝji > — 4 
yE bi min/ Ja (E+ 1) Fa 


6 = [b11,---,49,k,] #8 consistent, i.e. 
P(S=8) >l-e. 


Before the proof of Theorem 4.1 we state and prove an auxiliary result Proposition 4.3, 
which is interesting on its own. It describes a slightly more general case and it will be used 


multiple times for different numbers and sets of predictors and targets in order to prove 
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Theorem 4.1. Moreover, to avoid any confusion with indices and notation introduced 
before for convenience we use more general notation in subsequent proofs. 

Hence, let (Y1, Z1),.--, (Ym; Zm) be i.i.d. random vectors such that Y; € R? and Z; € R. 
The coordinates of Y; will be denoted by Y;; for each j = {1,...,p} and by Y we denote 
the full (m x p)—matrix of predictors Y = (Y1,...,Ym)'. Moreover, let H = Cov(Y) is 


a positive definite matrix with diagonal elements H;; = 1 for j = 1,...,p. We assume 
that 

Zea BV +e, i=1,...,m, (4.7) 
where €1,...,€m are iid. random variables with Ee; = 0, which are subgaussian with 
the parameter o°, and are independent of p predictors Y;,..., Yp. Subgaussianity means 


that for each 7 anda € R 


Eexp(ae;) < exp(a?o?/2). 


We also assume that predictors are subgaussian with the parameter 7°, i.e. Eexp(aYi;) < 
exp(a?77/2) for each j = 1,...,p. 
The goal is to find the set of indices of the relevant predictors 


S= {j € {1,..., p}: B; #0}. (4.8) 


The set S° denotes the complement of S, that is the set of zero coordinates of 8. Now 
consider the LASSO estimator 


A 


6 = argmin|RSS(0) + All6ll,], (4.9) 
GER? 


where 


RSS(0) = — y (Z= 0Y). 


i=1 
For any € > 1 we define the cone C (E, S) = {0 : ||0se 


1 < €|/Os||1}. Then CIF is defined as 


- ; |¥TYO/mlloc 
Fe) =. if —— 
© 040eC(E,S) Olla 
and its non-random version is given by 
[Hll 


i) = 


= 1n 2 
o40€C(é,S) Illl 


Proposition 4.3. Fix arbitrary a € (0,1) and € > 1. Suppose that F(E) is positive and 
| KilSPr#(1 + £)? log(p?/a) 


4.10 
and 
Eel log(p/a) 
> r 

A2 Kae 79 aa (4.11) 

where Kı, Kə are some universal constants. Then 

a 4EX 

P B— Bln < epee) > 1 2a 4.12 
(12-81 < SHE m 
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The proof of Proposition 4.3 relies on Lemma 4.4 and 4.6 below. 


Lemma 4.4. In the context of previously defined random variables Yi; and ci, where 
i= {1,...,m}, for arbitrary j =1,...,p and u > 0 we have 


1€ 2u u 
P| -5> Yei > 2ra | 2\/— 42) | < exp(-u). 
G 2 jE TO ( z + 3) exp(—u) 


The proof of Lemma 4.4 uses the following Corollary 8.2 of van de Geer (2016). 


Lemma 4.5. Suppose that Zı,..., Zn are i.i.d. random variables and there exists L > 0 
such that C? = E exp (|Z,|/L) is finite. Then for arbitrary u > 0 


P (2 WZ, —EZ;) > 2L (cy 9) < exp(—u). 


i=1 


Proof of Lemma 4.4. Fix j = 1,...,p and u > 0. We consider an average of i.i.d. centered 


random variables Z; = Y;je; with EZ; = 0, so we can use Lemma 4.5. We need to find 
L,C > 0 such that E exp (|Y1;61|/L) < C?. Note that 


E exp (|¥ijé1|/L) < E exp (Yi;e1/L) + E exp (—Yi,61/L) . (4.13) 


For the first term on the right-hand side of (4.13) we have 


E exp (Y1j61/L) = E [E (exp(Yije1/L) | Yaz)]. 


Using independence of Y;; and £1, and subgaussianity of €, for each y € R we obtain 


E [exp (Yiz£1/L) | Yiy = y] = E exp (yer/L) < exp (y’o*/(2L")) . 


Therefore we have 


E [E (exp(Y1jé1/L)|Yi;)| < E exp (Yjo°/(2L")) , 


which, using subgaussianity of Y;; and Lemma 7.4 of Baraniuk et al. (2011), we can bound 


from above by 
1 


provided that L > to. The second expectation on the right-hand side of (4.13) can be 


bounded analogously, hence, we obtain 
2 
VO — 7202/L?)’ 


provided that L > ro. We can take L = 27o and obtain C > ae which finishes the proof. 


E exp (\Yijei|/L) < 


Lemma 4.6. Suppose that assumptions of Proposition 4.3 are satisfied. Then for arbitrary 
e € (0,1) and € > 1 with probability at least 1 — € we have F(€) > F(€)/2. 
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Proof. Fix € € (0,1) and € > 1. We start with considering the /,.-norm of the matrix 


1 
liry — EY, Y; 
m 


1 
DD a o 
m 


oO 


Fix j,k € {1,...,p}. Notice that for any two numbers a and b we have the inequality 


2 2 è 
ab < Ṣ + 5. Hence, we can write 


Y2 Y2 
vis Yik| < T + J 


Therefore, first using the previous inequality and Cauchy-Schwarz inequality afterwards 


for any positive constant L we obtain 


zexp ([Y;4Yin|/L) < Eexp (¥2/(2L)) exp (Y /2L)) 
(4.14) 
< yE exp (Y?,/L) Eep (Yj,/L). 


The variable Yi; is subgaussian, so using Lemma 7.4 of Baraniuk et al. (2011) we can 

—1/2 
bound the first expectation under the square root in (4.14) from above by (1 = 27? : 
provided that 27? < L. The second expectation under the square root in (4.14) can be 


bounded by the same value when we use the subgaussianity of Yip. Therefore, 


Q72\ T 
op (Yas (1-7) 
provided that 27? < L. Applying Lemma 4.5 with L = 37? and C = 2 and u = log(p?/e) 


we obtain 
] 2 
e( > Kr? og(p A) < = 


m p2’ 
where K is an universal constant. Therefore, 


> Kr] =n] - 


m 


1 m 
Y Yara EN 
Hy i=1 


oO 


1 
P (r -— EY,' Y; 
m 


1 
—S°Yij¥in — EYijYik 


> Kr? ear) £ (4.15) 


Proceeding similarly to the proof of Lemma 4.1 of Huang et al. (2013) we first obtain 


[ir nss ace <fi- m- 
m m ee m 5o 


OO 


1 
= lary E n| (Asli + llðsell) < + 8S1- Illo ry —EY,'Y, 


OO 
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This implies that 


1 1 
|e > || — (1+ 151 -[Pll | -¥TY - ENT 
m m 


co co 


Then by dividing both sides by ||@||,., taking infimum with respect to 8 over the cone C(&, S) 
and using (4.15) we derive that 


F(E) > F()- K(1 + Slr? sate 


with probability higher than 1 — e. Finally, using (4.10) we have 


Hazro- FO = (1 P a) F(8). 


We finish the proof by taking sufficiently large K4. 


Proof of Proposition 4.3. The central part of the proof is to show that 


a DEA 
P (1 -= Blloo S Sa) ae (4.16) 


Let us denote 2 = {||VRSS(B)|loo < A} Now we want to bound from below the pro- 
bability of Q. For each j = 1,...,p we can calculate j-th partial derivative of RSS(0) at 


true 8 


V;RSS(8) = oe eg 3 Ta 


and we bound it from above with high tite using Lemma 4.4. Therefore, taking 
(4.11) into account we have 


P(Q) =P (mex IVR) < A “(0 fiv, RSS(8 sn }) 2 
=1-P (Ó {IMsRS5(9) > Ea) 21- Sp (Iv,ns5(a) > i) > 


j=l j=l 


Eie Sr (vass) sir e/a) l 


Now applying Lemma 4.4 with u = log(p/2a) and appropriately chosen Ky we bound 
from below this probability by 1 — a. 
In further argumentation we consider only the event Q. Besides, we denote B= B -B 


where 6 is a minimizer of a convex function given in (4.9), which is equivalent to 


EP Shei, 32, 
bees ad (4.17) 
Pals. fÂ; =0, 
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where j = 1,...,p. Next we show that @ € C(€,S). Our argumentation is analogous to 
Ye and Zhang (2010). From conditions in (4.17) and the fact that ||5lj = ||Gs|la + |[Gsell1 


we obtain 


0<BTYTYB/m= a |VRss(3) ~ VRSS(8) 
= 5° 6;V;RS5(8) + X` BV; RSS(B) — B'VRSS(B) < 


jes JESe 
SAS TB I-A S— [BL + Ilall VESS) = 
jes jese 


= [A + ||VRSS(b)ll]llősllı + [IV RSS(B) loo — MllBsella 
Since we exclusively consider the event Q, we obtain the following inequality 


< À + IVRSS(B)ll 
1 $ XL I[VRSS A)|] 


lős lőslh < £llőslh. 


Hence, we have just proved that 3 belongs to the cone C (£, S). Therefore from the defi- 
nition of F (£) we have 


13 oja < LOX G=Airmllse < ITRSSÓ)l + ITRS 
( F) 

Using the second condition in (4.17) and the definition of the event Q we then obtain 

(4.16). Finally, having shown (4.16), we apply Lemma 4.6 and obtain (4.12) which finishes 

the proof. 


Proof of Theorem 4.1. In order to show that our estimator is close to the true parameter 


vector 6 we first use union bounds. So here we have 


3 4g Aji Ai i AEA 5 _ 
P (18- Ble < Pe max BM) > P (n {18 Bille < zine} | - 


Jit 
a AEN; ; 
(Uw Bi ts} | > 
21- P { 165 =A > et), 
> (1 EFIA 
Then, using Proposition 4.3 separately for each variable Xi in each layer £; for j = 1....,q 
with A = \;;, with the number of predictors equal to p = |L;—ı| and a = — we obtain 


j 
that the expression above can be bounded from below by 1 — e€. The bound on the number 


of observations m is chosen according to (4.5) and (4.6). 


To prove Corollary 4.2 we apply the same methodology. Namely, we prove an auxi- 
liary lemma concerning the model described by (4.7), so the set S is defined by (4.8). 
Additionally, by min we denote the smallest in the absolute value non-zero coordinate 
of the true parameter vector 6. By ĝ we denote the set of non-zero coordinates of 
the Thresholded LASSO estimator with the level ô, i.e. the coordinates of the vector Ê, 
which are greater than 0. 
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Lemma 4.7. Fiza € (0,1) and € > 1. Then under the assumptions of Proposition 4.3 


and 


we have 


Proof. Take any j € S. Then from Proposition 4.3 and (4.18) with the probability greater 
than 1 — a we have 


Bs| = |8; — Bl < IIB —Blleo < 6. 
Therefore, the j-th coordinate of Thresholded LASSO ee = 0. Next, we take j € S and 
obtain, also from Proposition 4.3 and (4.18), that with probability greater than 1 — a 


e SH ST Sta S Sd 


Hence, a #0. 


Proof of Corollary 4.2. From Lemma 4.7 for each j € {1,...,q} and i = {1,... , kj} under 
the assumptions of Theorem 4.1 we have that for arbitrary aj; € (0,1) 


P (Sja F Sia) < Gji. 


Now we obtain 


P @ a s) =P (Us. # sw) Š 5 Sr (Sja P Sia) i 


ji 


By taking a;; = < we obtain the bound P(S5 # S) < £ and finish the proof. 
qh; 


4.4 Discrete case 


As we discussed in Section 3.1 in the discrete case as the distribution of the model we 
take a collection of categorical distributions for each variable. First we assume a binary 


case so that each X; € {0,1} and we consider the logistic regression model. Let us denote 


the sigmoid function as o(x) = i . In this setting we can write probabilities for 
et 


each variable in each layer similar to (4.1) as follows 


PG =1)= (Big + By 9X4 + : -+ BE XE) 
P(X" = 1) = o(fih + oo + BIg XG) ae 
P(X} = 1) = o (B39 + 20X0 +--+ +A PXE + By XE He + BT XT) l 


Paren +> Bx). 
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Using the same notation as for the continuous case we need to solve the following d opti- 


mization problems 


b= argmin [6,,(0)+AjllOlli], F=1,.-..¢ t=1,...,h;, 
geR or +kj—1 
where lj; is the negative log-likelihood for the i-th variable in the j-th layer and has 
the following form 


tlO) = =X Xeno [ox] +0 = Xosa os [i =a (07x00=)]. 
l=1 


Here we denote by p = p(j) = ko +--+: + kj-ı the number of variables in the pre- 
vious j — 1 layers. 

We can also generalize the above case to the case where each variable has a discrete and 
finite state space, namely each Xi € {1,... , Nj}. Now instead of the sigmoid function we 


use the so-called softmaz function. For any vector a = (a1,...,@n) we define the softmax 
exp(a;) 
Daj=1 exp(a;) 


We denote as X; = (QE XG sage)" for j = 1,...,q. Also we denote 


function o(a) as the vector o(a) = (a(a)[1],...,0(a)[n]), where o(a)|i] = 


the vectors of parameters corresponding to the l-th class of the i-th variable in the j-th 
layer as 
i i1 i,ko i1 ikj—1 
PAU — (67oll], ees Bro UF 671l], srt Mj j1 {]) 
for j = 0,...,q — 1, i = 1,...,kj and l = Desa Ne: Then the model analogous to 
the logistic model in (4.19) takes the form 


P(Xi = 1) = o(bi oll] + Ai[1]X1, - - - , 81 olNi] + GNX.) 
P(X} = NJ) =o( Soll] + 820%... LINE] + AHNEIXOINA 
PXE = 1) = (8t) + BUX, SINE) + PNE IXO 
P(X = 1) =o (65 [1] + Bi[UX;,..-, Be olNs] + BING XA)Y 


P(X}? = NE) = o (Biol + 8r Xo -o BedLNG"] + LNG |X) NG"). 


This is called multinomial logistic regression. It is not difficult to notice that logistic 
regression is a particular case of multinomial logistic regression with two possible classes. 
For each variable X f we denote the full vector of parameters p; = (6i [es b; N. I). Then 
we need to solve d optimization problems analogous to the case of logistic regression 


Bi = argmin WE; a(O) T. Aj alll], Fe leeds, tala < kj, 


ger 0H +j] 
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where ¢;; is also the negative log-likelihood for the i-th variable in the j-th layer and in 
this case has the following form 


lalh) = -XY UX pro = k) | [XY — log | X Ax» ] |, 
l=1 k=1 iA 


where we again denoted by p = p(j) = ko +--+ kj—ı the number of variables in the pre- 
vious j — 1 layers. 


4.5 Numerical results 


In this section we describe the details of algorithm implementation as well as the results 


of experimental studies comparing our algorithm to others. 


4.5.1 Details of implementation 


We provide in details practical implementation of the proposed algorithm. The solu- 
tion of (4.2) depends on the choice of \;;. Finding the ,optimal” parameters A; and 
the thresholds 6; in practice is difficult. We solve it using the information criteria (Xue 
et al., 2012; Pokarowski and Mielniczuk, 2015; Miasojedow and Rejchel, 2018). 
First, recall the function which is being minimized in (4.2) 
1 2 ae 
RSS;4(9) + jalll = 5 |X - OT XD +S) SS Om, 
1=0 m=l 
with X/[i] being the vector of the length m of observations for the i-th variable in the j-th 
layer. We perform the optimization separately for each variable and the vector 0 is from 


Reot-+ki-1 for j = 1,...,q andi =1,...,k,. In our implementation we use the following 


scheme. We start with computing a sequence of minimizers on the grid, i.e. for any j 
and į we create a finite sequence {Ap H] uniformly spaced on the log scale, starting from 
the largest A, which corresponds to the empty model. Next, for each value A; we compute 
the estimator Bi [k] of the vector 6; 

6 [k] = argmin {RSS;:(0) + AzllAli}. (4.20) 

GeR*ot--kj-1 
To solve (4.20) numerically for a given A we use the FISTA algorithm with backtracking 
from Beck and Teboulle (2009). The final LASSO estimator Bs = Bi [k*] is chosen using 
the Bayesian Information Criterion (BIC), which is a popular method of choosing Aj, in 
the literature (Xue et al., 2012; Miasojedow and Rejchel, 2018), i.e. 
k* = argmin {m log(R55(8%[k])) + log(m)Ilĝ;k]llo } 
1<k<N 
Here lô: [lk]||o denotes the number of non-zero elements of Bi [k] and m is the number of 


observations of the network. In our simulations we use N = 100. 
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Finally, the threshold ô is obtained using the Generalized Information Criterion (GIC). 
A similar way of choosing a threshold was used previously in Pokarowski and Mielniczuk 
(2015); Miasojedow and Rejchel (2018). For a prespecified sequence of thresholds 2 we 


calculate 


Öja = argmin { mlog(RSS(B}.s)) + log(ko +--+ kya) IBjsllo} 
Eg 


where oe is the LASSO estimator p after thresholding with the level ô. 


4.5.2 Experiments 


In this subsection we compare our algorithm to other algorithms developed for this prob- 
lem applying them to benchmark networks. We use the bnlearn package in R (Scutari 
(2010)), in which many algorithms for learning Bayesian networks including structure 
learning are implemented. Algorithms of different types discussed in the beginning of 
this chapter such as constraint-based algorithms, score-and-search algorithms and hybrid 
algorithms can be found there. The choice of specific algorithms was made empirically, 
i.e. we selected the best performing ones on the chosen networks. We took the networks 
with continuous data of a medium, large and very large amount of nodes and arcs. We 
refer to medium, large and very large sizes as 20-50 nodes, 50-100 nodes or 100-1000 
nodes, respectively, adopting this classification from the authors of bnlearn package. 

We chose a medium-size network ECOLI70 with 46 nodes and 70 arcs (Schäfer and 
Strimmer (2005)), a large network MAGIC-IRRI * with 64 nodes and 102 arcs and 
a very large network ART H150 with 107 nodes and 150 arcs (Opgen-Rhein and Strim- 
mer (2007)). The algorithms chosen for comparison are hill-climbing (hc) algorithm, 
tabu search (tabu), max-min hill-climbing (mmhc) and Hybrid HPC (h2pc) algorithm. 
Hill-climbing (Scutari et al. (2018)) is a greedy search algorithm that explores the space 
of the directed acyclic graphs (DAGs) by an addition, removal or reversal of a single 
edge and uses random restarts to avoid local optima. Tabu search (Russell and Norvig 
(2010)) is a modified hill-climbing method, which is able to escape local optima by se- 
lecting a network that minimally decreases the score function. Both methods above use 
search-and-score approach. Max-min hill-climbing algorithm (Tsamardinos et al. (2006)) 
is a hybrid method combining a constraint-based algorithm called maz-min parents and 
children and hill-climbing. H2PC (Hybrid HPC, Gasse et al. (2014)) algorithm is a hybrid 
algorithm combining an ensemble of weak PC learners (Spirtes and Glymour (1991)) and 
hill-climbing. For more different comparisons of methods in bnlearn package see Scutari 
et al. (2018). 


*The model M AGIC-IRRI was developed as an example of multiple trait modelling in 
plant genetics for the invited talk “Bayesian Networks, MAGIC Populations and Multiple 
Trait Prediction” delivered by Marco Scutari, the author of bnlearn package, at the 5th 
International Conference on Quantitative Genetics (ICQG 2016). 
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For each network we used two sizes of the data set with m = 300 and m = 1000 
observations. In the tables with results we denoted them with 2-3 first letters of the name 
of the network followed by the number of observations so it does not create any confusion. 
For each algorithm we ran the experiment for 100 times, each time with a new set of m ob- 


servations, and averaged the results in terms of three performance measures: 
e power, i.e. the proportion of correctly discovered edges. 


e false discovery rate (FDR), i.e. the fraction of incorrectly selected edges among 
all selected edges. 


e structural Hamming distance (SHD), i.e. the smallest number of operations 
(such as adding or removing the edge and changing the direction of the arrow) 
required to match the true DAG and a learned one. 


In Tables 4.1-4.3 we provide the results of experiments for mentioned above data sets and 
methods including ours. In terms of power our algorithm performs right in the middle 
of score-and-search and hybrid methods for ECOLI70 data set, similarly to the hybrid 
methods in case of small MAGIC-IRRI data sets. We note that for data sets of 1000 
observations it performs worse than other methods, however, with the number of obser- 
vations growing the algorithm’s power grows as well. With the number of observations of 
10000 it grows up to 0.5 performing as good as other score methods without any increase 
in FDR. The same situation we observe with ART'H150 data set. 

In terms of FDR our algorithm performs the best consistently giving very low numbers 
for false discoveries. This is especially important when the cost of a false discovery is high 
and makes obtained discoveries more certain. With the numbers of observations of 10000 
we constantly get numbers in the range 0.2-0.4% for all data sets. When it comes to 
structural Hamming distance (SHD) our algorithm performs the best or close to the best 
numbers as well. With the growing number of observations it decreases due to increasing 
power and consistently low FDR. For the number of observations of 10000 it outperforms 
other algorithms or has close SHDs to hybrid methods and reaches around 28, 62 and 86 
for ECOLI70, MAGIC-IRRI and ART H150 data sets, respectively. 

We also checked our method on a discrete binary network called ASTA, introduced 
in Chapter 3. It is a small network of 8 nodes and 8 edges. Our algorithm recognizes 6 
arrows and makes 2 false discoveries, discovering 8 arrows in total. However, after a closer 
look we noticed that it could not recognize 2 arrows due to incorrect assignment of layers. 
Finally, one false discovery was an arrow of an opposite direction to the true one, and 
the other one was an arrow from the start to the end of a causal trail (we obtained 
an additional arrow X — W for the trail X => Y — Z —> W), hence still recovering 


dependencies in both cases. 
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Method EC300 EC1000 MAG300 MAG1000 AR300 AR1000 
he 0.57 0.65 0.28 0.45 0.58 0.67 
tabu 0.6 0.7 0.31 0.5 0.59 0.68 
mmhc 0.39 0.45 0.25 0.46 0.48 0.58 
h2pc 0.4 0.49 0.23 0.45 0.5 0.61 
MCMC + LASSO 0.49 0.55 0.24 0.38 0.38 0.46 


Table 4.1: Average power for ECOLITO, M AGIC-IRRI and ARTH150 


300 and 1000 observations. 


networks for 


Method EC300 EC1000 MAG300 MAG1000 AR300 AR1000 
hc 0.049 0.044 0.04 0.037 0.028 0.19 
tabu 0.047 0.036 0.04 0.036 0.028 0.018 
mmhc 0.021 0.024 0.022 0.022 0.01 0.008 
h2pc 0.02 0.023 0.14 0.019 0.006 0.006 
MCMC + LASSO 0.004 0.004 0.004 0.006 0.004 0.003 


Table 4.2: Average FDR for ECOLI70, MAGIC-IRRI and ART H150 networks for 300 


and 1000 observations. 


Method EC300 EC1000 MAG300 MAG1000 AR300 AR1000 
he 65.6 51.8 129.5 96.5 214.2 151.1 
tabu 64.8 46.9 127.1 93.1 215 150.9 
mmhc 48.1 39.2 101.7 72.2 127.3 104.1 
h2pc 45.9 37.9 91.6 67.21 103.7 90.3 
MCMC + LASSO 39.2 35.1 85.9 79.3 103 98.7 


Table 4.3: Average SHD for ECOLI70, M AGIC-IRRI and ART H150 networks for 300 


and 1000 observations. 
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Chapter 5 


Structure learning for CTBNs for 
complete data 


In this chapter we consider continuous time Bayesian networks (CTBNs) introduced and 
defined in Section 2.5. First we consider the fully observed case where we observe the be- 


haviour of the network at each moment of time. 


5.1 Notation and preliminaries 


In this section we describe the proposed method, using the notation introduced in Sec- 
tion 2.5. First, we consider the full graph G = (V,€), namely we assume that pag(w) = 
pa(w) = —w for each w € V. Then we remove unnecessary edges using the penalized 
likelihood technique. We start by introducing the new parametrization of the model. For 
simplicity, in the main part of this chapter we consider the binary graph, i.e. Xu = {0,1} 
for each w € V. The extension of our results to more general case is described in Sec- 
tion 5.5. 

Let d be the number of nodes in the graph. Consider a fixed order (w1, w2,..., Wa) of 
nodes of the graph. Using this order we define a (2d) x d-dimensional matrix 


+ 
= wi wi QW2 QW2 Wd QWa 
B ~~ ( Dlr LO “0,1? 108 ** 2 ~0,1? ia) ’ (5.1) 


whose rows are vectors BY, € R¢ for all w € V and s,s’ € {0,1} such that s ¥ s. 
Obviously, the matrix £ can be easily transformed to 2d?-dimensional vector in a standard 
way. In this chapter we assume that for all w € V, c E€ Xw, 5,8’ € {0,1}, s Æ 8’ 


the conditional intensity matrices satisfy 


log(Qu(c, 5, 8')) = Bw! Zale), (5.2) 


where Zw: X -w — {0,1}! is a binary deterministic function described below. With 
the slight abuse of notation, by Z we will denote the set of all functions Zy,,..., Zw,- 
In (5.2) the conditional intensity matrix Q,,(-,s,s’) is modeled in the analogous way 
to the regression function in generalized linear models (GLM) and the functions Z,,(-) 


TT 


play the role of explanatory variables (covariates). In our setting the link function is 
logarithmic. The analogous approach can be found in Andersen and Gill (1982); Huang 
et al. (2013), where the Cox model is considered. The relation between the intensity and 
covariates in those papers is similar to (5.2). Since the considered CTBNs do not contain 
explanatory variables, we introduce them artificially as any possible representations of 
parents’ states. Thus, for every w € VY these explanatory variables are dummy variables 
encoding all possible configurations in pa(w) = —w. To make it more transparent we 


consider the following example. 


Example 5.1. We consider CTBN with three nodes A, B and C. For the node A we define 
the function Z4 as 


Za(b, c) = (1, (b = 1), (c = 1)" 


for each b,c € {0,1}, where I(-) is the indicator function. Therefore, for each configu- 


ration of parents’ states (i.e. values in the nodes B and C) the value of the function 
Za(-,:) is a three-dimensional binary vector, whose coordinates correspond to the inter- 


cept, the value in the parent B and the value in the parent C, respectively. Analogously, 


we define representations for remaining nodes: Zp(a,c) = [1,I(a = 1), I(e = 1)|' and 
Zo(a, b) = [1, I(a = 1), 1(b = 1)]' for each a,b,c € {0,1}. In this example the parameter 
vector (5.1) is defined as 8 = ae Pio (cae Bre BSa» Boy With slight abuse of notation, 
the vector can is given as Boi = oe BA, (B), BAC)" and we interpret (5.2) as fol- 
lows: Foie) = 0 means that the intensity of the change from the state 0 to 1 at the node A 
does not depend on the state at the node B. Similarly, Ber (C) describes the dependence 


between the above intensity and the state at the node C, and bé (1) corresponds to 
the intercept. For the node B the coordinates of the vector 62, = [b8 (1), 63,(A), 664,(C)] 
describe the relation between the intensity of the jump from the state 0 to 1 at the node B 
to the intercept, states at nodes A and C, respectively. 

Now what if Z = {Z4, Zg, Zc} was defined differently? The new function Z4 can be 
defined in 3 more different ways, for example Z 4(b, c) = [1,1(b = 0), I(c = 1)|'. The same 


applies to the functions Zg and Zc. Having defined Z = {Z 4, Zc, Zc} we obtain the new 
vector of the parameters 8. Then for instance we have a = [Boa (1) Bo (B), Bo. (O)| 
and so on. Note that both sets Z and Z fully describe the state configuration of the net- 
work and both Béh and Ba correspond to the same dependencies as above. In par- 
ticular, it is easy to check that for instance 6¢h(B) = 67(B) = 0 if and only if 
Boa(B) = B1o(B) = 0. 

Analogously as in Example 5.1, for w € V, u # w, and s,s’ € {0,1}, s 4 s’ we 
define the coordinate of the function Z,, corresponding to the node u as an indicator of 
its state equal to either 0 or 1. Moreover, we denote the coordinate of 6%,, corresponding 
to the node u by BY,,(u). We interpret 8%}, (u) as the parameter describing dependence of 
the intensity of the jump from the state s to s’ at the node w on the state at the node u. 

Our goal is to find edges in a directed graph (V,€). We define the relation between 
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parameters and edges in (V,€) in the following way 
boilu) £0 or BYo(u) #0 & the edge u > w exists, (5.3) 


which makes parameters compatible with the considered CTBNs. Roughly speaking, 
the node u is a parent of w means that the intensity of switching a state at w depends 
on the state at u. Therefore, the problem of finding edges in the graph is reformulated as 
the problem of the estimation of the parameter (. 


Remark 5.2. As we mentioned previously in the example, the set Z fully describes the pa- 
rents state configuration and the relation above does not depend on the choice of Z. More 
precisely, assume we have two different properly defined Z and Z and the corresponding 
vectors of parameters 


B Z wi wi w2 w2 Wd Wd 
= OL 91,0 90,1 91,0 2'+*9h0,1 21,0 ’ 
WAwWwad Wada T 


WY Wy W2 W2 
B= (Boi , B10 , Êo , B10 jeer bon , B10 ) 


Then the following is true 


BY, (u) = 0 A bP lu) = 0 & Boilu) = 0 A Bi olu) = 0. 


This means that no matter how we define our explanatory functions Z,, we will get 


the same arrows in the underlying CTBN. 


Remark 5.3. For simplicity, in the rest of the thesis, we omit the first coordinate 8¥,(1) in 
the vector BY, for all w, s # s’, because it corresponds to the intercept and is not involved 
in the recognition of the edges in the graph. The first coordinates of representations Z,,(c) 


are discarded as well. 


Remark 5.4. The Markov equivalence /identifiability /non-uniqueness problem is challen- 
ging for directed graphical models. However, this problem does not appear here for 
CTBNs. It is a consequence of our Assumption (5.2), which states that we restrict to 
models having a conditional intensity in the GLM form. Moreover, under this assump- 
tion 6 is uniquely determined. Moreover, this uniquely defined 6 determines uniquely 
the structure of a graph by (5.3). In fact, our main result (Theorem 5.5 below) shows 
consistency of the estimator of B, which is a much stronger property than identifiability. 
Finally, in Assumption (5.2) we require that a conditional intensity of a variable is a linear 
function of the states of its parents. This condition can be easily extended to a polynomial 


dependence, so it can cover quite general dependence structure. 


Our method is based on estimating the parameter 6 using the penalized likelihood 
method. In the rest of the thesis the term £ is reserved for the true value of the parameter. 
Other quantities are denoted by 0. First, we consider a function 


Lo) = IS pp ` [-nw(e; s, 8024" Zwle) + tulc; s) exp (Oy "Zul(c))], (5-4) 


WEY cEX_y SES 
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where the third sum in (5.4) is over all s,s’ € Xy such that s # s’. Recall that nu(c; s, s’) 
and t,,(c; s) were introduced in Section 2.5 to denote the number of jumps from a state s 
to s and the total time in the state s for the node w, respectively, while the parents 
configuration equals to c. Notice that the function (5.4) is the negative log-likelihood. 
Indeed, we just apply the negative logarithm to the density (2.8) combined with (2.9) and 
(5.2), where pa(w) = —w for each w € V. Then we divide it by T and omit the term 
corresponding to the initial distribution v, because v does not depend on 8. We define 
an estimator of 3 as 


B= argmin {€(8) + AllA||i}, (5.5) 


OER24(d-1) 
where |||) = >) >) X |02,(u)| is the l;-norm of 0. The tuning parameter A > 0 


wEV ss! ue—w 
characterizes a balance between minimizing the negative log-likelihood and the penalty 


function. As we have mentioned, the form of the penalty is crucial, because its singularity 
at the origin implies that some coordinates of the minimizer B are exactly equal to 0, 
if A is sufficiently large. Thus, starting from the full graph we remove irrelevant edges and 
estimate parameters for existing ones simultaneously. The function /(@) and the penalty 
are convex functions, so (5.5) is a convex minimization problem, which is an important 
fact from both practical and theoretical point of views. 

At first glance, computing (5.5) seems to be computationally complex, because the num- 
ber of summands in (5.4) is d2¢. However, the number of nonzero terms of the form 
Nw(c; s,s’) and t,(c;s) is bounded by the total number of jumps, which grows linearly 
with time T. Hence, most of summands in (5.4) are also zeroes and the minimizer (5.5) 
can be calculated efficiently. 

Before we state and prove main results of this chapter we introduce some additional 


notation. First, for each w € V we denote its parents indicated by the true parameter 8 as 
Sw = {u E —w : pylu) #0 or přou) £0}. 


By S we denote the support of 8, i.e. the set of nonzero coordinates of 8. Moreover, min is 
the smallest in the absolute value element of 8 restricted to S. The set S° denotes 
the complement of S, that is the set of zero coordinates of 8. Besides, for each w € V we 
define —S, =V\{S,Uw} and denote A= max Q(s,s’). 


s,s EX, ss! 
Recall that for a vector a we denote its l-norm by ||a||,, = max, |a,|. For a subset T 


the vector ar denotes a vector such that (ar); = a; for i € I and (ay); = 0 otherwise. 
Moreover, |/| denotes the number of elements of J. 

Let m be the stationary distribution of the Markov jump process (MJP), which is 
defined by an intensity matrix Q. The initial distribution of this process is denoted by v 
and we define ||v ||} = X v?(s)/m(s). Moreover, pı denotes the smallest positive eigenvalue 

a 


SE 
of the matrix —1/2(Q + Q*), where Q* is an adjoint matrix of Q., 
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5.2 Main results 


In this subsection, we state two key results on the structure learning for CTBNs for 
complete data. In the first one (Theorem 5.5) we show that the estimation error of the mi- 
nimizer 3 given by (5.5) can be controlled with probability close to 1. In the second main 
result (Corollary 5.6) we state that the thresholded version of (5.5) is able to recognize 
the structure of the graph with high probability. 

First, we introduce the cone invertibility factor (CIF), which plays an important role 
in the theoretical analysis of the properties of LASSO estimators. Our goal is to show that 
the estimator Ê is close to the true vector 8. To accomplish this goal we show in Lemma 5.9 
that the gradient of the likelihood (5.4) evaluated at 8 is close to 0. However, this is not 
sufficient since the likelihood function cannot be too “flat”. Namely, its curvature around 
the local optimum needs to be relatively high, because we want to avoid the situation 
when the loss difference can be small whereas the error is large. In the high-dimensional 
scenario this is often provided by imposing the restricted strong convexity condition (RSC) 

n (5.4), as in Negahban et al. (2009). CIF defined below in (5.6) plays a similar role 
to RSC, but gives sharper consistency results (Ye and Zhang, 2010). Therefore, it is 
used here. CIF is defined analogously to Ye and Zhang (2010); Huang and Zhang (2012); 
Huang et al. (2013) and is closely related to the compatibility factor (van de Geer, 2008) 
or the restricted eigenvalue condition (Bickel et al., 2009). Recall that in the previous 
chapter we also used a version of CIF for Bayesian networks. Thus, for any € > 1 we 
define the cone C(€, S) = {0 : ||@sel]1 < €||@s||1}, where the set S denotes the support of 8 
as mentioned above. Then CIF is defined as 


_ 01V” 4(6)0 
F = in wa it Wan 
(8) 040€C(E,S) ||Og||1||P|loo 


(5.6) 


Notice that only the value of the Hessian V?é(6) at the true parameter £ is taken into 
consideration in (5.6). The main difficulty with CIF in our case is that it is a minimum of 
the sum of random terms, which number grows exponentially in d. To be able to control 
this quantity, we bound CIF from below by its deterministic counterpart with much fewer 
summands. Namely, in Lemma 5.11 we prove that F(€) is bounded from below by the 
product of Ço given in Theorem 5.5 and 


exp (67 Zu(es,,,0)) [6°77 Zw(es,,,0)]” 
Pj= Sg y = (5.7) 


wey ES cs, CXS, llðsljı Allo 


with probability close to 1. Here we divided each parent configuration c = (cg,,,c_s,,) 
into two parts: the first one corresponds to the true parents nodes and the second part 
corresponds to the remaining nodes. Below we will also use a similar notation for any state 
of the network s € ¥ defining it as a triple s = (cg,,, c_s,,; S) of the state of true parents 
of the node w, the configuration for the nodes from S_, and the state in the node w. 
Note, that we restricted the summation in (5.7) only to cs, E ¥s,, by taking c_s, = 0. 
This allows us to derive the lower bound on F (£) without considering exponentially many 
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random summands. Our argumentation will be also valid in the case, when we choose 
some nonzero values in c_g,,, unless this values depend on w and cz,,. Next, we state two 


main results of this chapter. 


Theorem 5.5. Fix arbitrary £ € (0,1) and € > 1. Suppose that 


36 (may |Sw| + 1) log 2 + log (d||v||2/€) 


T > : (5.8) 
2 ; 
Cee 
CSy EX Sy 
We also assume that TA > 2 and we choose X such that 
&+1 A 2&F (E) 
2> log(K/e)4f/ = <A < —> 5, 5.9 
an eN yS < EDIS] (5.9) 
where K = 2(2 + e”)d(d—1) and 
Co= min, (Esu: 0; 8)/2. 
CS EXSw 
Then with probability at least 1 — 2e we have 
a 2eEX 
pm Ple Sa a 5.10 
Pll = EGF os 


Now consider the Thresholded LASSO estimator with the set of nonzero coordinates S. 
The set Ô contains only those coefficients of the LASSO estimator (5.5), which are larger 
in the absolute value than a pre-specified threshold ô. 


Corollary 5.6. Suppose that assumptions of Theorem 5.5 are satisfied and let R denote 
the right-hand side of the inequality (5.10). If R < Bmin/2, then for 6 € |R, Bnin/2) we 
have P (= 8) > 1—2e. 


These two results will be proven in the next section and here we give some comments 
on their meaning and significance. The above two results describe the properties of 
the proposed estimator (5.5) in recognizing the structure of the graph. Theorem 5.5 gives 
conditions under which the estimation error of (5.5) can be controlled. Namely, let us 
for a moment ignore constants, A and parameters of MJP such as v,7, p1,¢0, etc. in 


the assumptions. By condition (5.9), if 


, os*(d/e)S? 
2 PO 


then the estimation error is small. This forms some restrictions on the number of vertices 


T (5.11) 


in the graph, sparsity of the graph (i.e. the number of edges has to be small enough) and 
the expression (5.7), which is discussed in Lemma 5.7 (below). The condition (5.11) is 
similar to standard results for LASSO estimators in Ye and Zhang (2010); Bühlmann and 
van de Geer (2011); Huang and Zhang (2012); Huang et al. (2013). The only difference is 
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that the right-hand side of (5.11) usually depends linearly on log(d/e), but here we have 
log?(d/e). The square in the logarithm could be omitted, if we imposed some additional 
restrictions on observation time T in the crucial auxiliary result (Lemma 5.9), where we 
use the Bernstein-type inequality for the Poisson random variable. Obviously, it would 
reduce the applicability of the main result. In our opinion, the gain (having log(d/e) 
instead of log*(d/e)) is “smaller” than the price (additional assumptions), so we do not 
focus on it. 

The next assumption in Theorem 5.5 that TA > 2 is quite natural since observation 
time has to increase when the maximal intensity of transitions decreases. Moreover, 
the conditions (5.8) and (5.9) depend also on parameters of MJP. More precisely, they 
depend on the stationary distribution 7 and the spectral gap pı, which in general decrease 
exponentially with d. However, in some specific cases, it can be proved that they decrease 
polynomially. 

Corollary 5.6 states that the LASSO estimator after thresholding is able to recognize 
the structure of a graph with probability close to 1, if the nonzero coefficients of 6 are 
not too close to zero and the threshold 6 is appropriately chosen. However, Corollary 5.6 
does not give a way of choosing the threshold 6, because both endpoints of the inter- 
val [R, Bmin/2] are unknown. It is not a surprising fact and has been already observed, 
for instance, in linear models (Ye and Zhang, 2010, Theorem 8). In the experimental 
subsection of this chapter we propose a method of choosing a threshold that relies on 
information criteria. A similar procedure can be found in Pokarowski and Mielniczuk 
(2015); Miasojedow and Rejchel (2018). 


Now we state a lower bound for (5.7) which has an intuitive interpretation. 


Lemma 5.7. Define Ag = >> >> > exp (—8w,,(u)) - Then for every E > 1 we 
wEV s/#s up” y (u)40 


have F(E) > (4p). 


Notice that the term Ag is larger, and in turn F'(€) is smaller, when negative coefficients 
of 8 dominate” in the absolute value the positive ones. Note that, the more these negative 
coefficients dominate the more our process ,,gets stuck”, i.e. tends to stay in the same state 
because intensities in this case tend to be close to zero (recall (5.2)). Such behaviour in 
the context of MJPs is natural, because multiplying the intensity matrix Q by a constant «K 
is equivalent to considering time T/« instead of T. While F(€) appears in the lower 
bound (5.11) on T, such dependence on £ is expected. 


5.3 Proofs of the main results 


This subsection contains the proofs of all the statements made in the previous subsec- 
tion. The proofs of the theorem and the corollary are based on a number of auxiliary 
results. Some of these results are well-known facts for LASSO estimators and some of 
them are new (Lemmas 5.9 and 5.11). The main novelty and difficulty of the considered 


model is the continuous time nature of the observed phenomena which we investigate. 
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In Lemma 5.9 we derive the new concentration inequality for MJPs based on the mar- 
tingale theory. In Lemma 5.11 we give new upper bounds on the occupation time for 
MJPs. 

In the proofs of subsequent results we use the first and second derivatives of Z given 
by (5.4), which can be also expressed in the following form 


1 
(0) = = SD oes), (5.12) 
wEV s#s! 


where 


C8 (Oss) = D [—nu(c; 8, sr Zul(c) + tulc; 8) exp(6,,' Zw(c))| . 


cEX-w 


Therefore, we can calculate partial derivatives 


VER (OS st) = ` [—nw(c; s, 8’) + tulc; 8) exp (0y Zw(c))|Zw(c). (5.13) 


cEX_y 


By Remark 5.3 the matrix 6 of all parameters has 2d rows and (d — 1) columns. It can 


be also considered as a 2d(d — 1)-dimensional vector 


pe eh ae ee a 
where (w1, W2,..., Wq) is a fixed order of the nodes of the graph. Using this order we 
obtain the following representation of the gradient of £. 


VO) = = [Vee (8%) (5.14) 


weV,sés! ` 


Analogously we calculate second derivatives 


Vee (Ory) = S tu(c;s) expe 4." Zu(c))Zw(e)Zw(e)". 


cEX_w 


w 
s,s! 


The second derivative of (6) consists of matrices +V70 
R2d(d-1 


(0%) along its diagonal and 


zeroes elsewhere. Moreover, for any vector 0 € ) and the true parameter vector 6 


we have 


67 V20(8)6 = S? S Stolc: s) (02,7 Zale) expl pty" Zele). (5.15) 


WwWEV cEX_y SEs 


Next we provide an auxiliary proposition needed to prove an important concentration 


inequality for MJPs in Lemma 5.9. 


Proposition 5.8. Let X(T) be a Markov jump process with a bounded intensity matriz Q. 
Let 


Ngy = > ni (c s,s") 


cEX_w: Zw(c)[k]=1 
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be a number of jumps from s to s on the interval [0,7] and t, be an occupation time at 
state s on the interval [0,7]. Then 


M,(T) = n; y — t3Q(s, 8’) 


is a martingale with respect to the natural filtration F,. The notation M, (T) means that 
the distribution at time 0 is v. 


Proof. For any u < T we have 


E(M.(r) | Fu) = ML(u) + E(M, (7) — ML(u) | Fa) = 
= M,(u) + E(M,(r) — M, (u) | X(u)) 
= M,(u) +E(Mx (T — u) | X(u)), 


where the last equality is the consequence of Proposition 20.3 from Bass (2011). Now it 


is enough to show that for all r > 0 and all initial measures v we have EM,(7) = 0, since 
it implies that E(E(M,(r) | F,,)) = 0 for u < 7. 
For any n € N define the sequence k; = kj(n) = 7 for all i = 0,...,n. Since 


the trajectory of the process is cadlag, we have 


EM,(r) = zlim $ | (X (ki) = 8, X (ki) = 8’) — ZQ(8, 5E (X (kia) = 8)| 


We observe that for all n € N 


n 


DO [EX Bia) = 5, X (ki) = 8") - =Q(s, s) (X (ki) = 8)| 


i=1 


< N(T) +7, (5.16) 


where N(r) is the total number of jumps. Since N (rT) is a Poisson process with a bounded 
intensity, the right-hand side of (5.16) is integrable and by the dominated convergence 
theorem and the definition of Q we get 


i=1 


Next for s Æ s’ we have 


and for o Æ s 


Hence, 


E(X (ki) = 8, X (ki) = 5") | X(Ki-a)] = (ZQ(s, 8°) + o(1/n)) U(X (ki) = 8). 
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Therefore, we further obtain 


EM,(r) = lim E> | (= Q(s, 8!) + o(1/n)) ieaS “oga (X(ki_1) = 8) 


noo 


i=1 
n 


= lim EJ o(1/n)I(X(ki-1) = 8) =0, 


noo s 
i=1 


where o(1/n) does not depend on i. 


Lemma 5.9. Lete € (0,1) and € > 1 be arbitrary. Assume that TA > 2 and 


1 A 
i2 2 = - og(K/e) S, 


where K = 2(2 + e°)d(d — 1). Then we have 


€-1 
p (IIe < E=) sies 


Proof. Note that by (5.2), (5.13) and (5.14) we have the following inequality 


1 
IVE) lloo SF yey BE pea S els, 8") — tule; s)Qu(e;s, S], 


cEX_w: Zw(c)[k]=1 


where Z,,(c)[k] is the k-th coordinate of Z,,(c) for each w € V, c € X_,. The core step of 
the proof is to show that for fixed w E€ V, s # s', 1 < k < d — 1 and 7 = 2log(K/e) 


P D Melas, 8!) tula 8) Qu(c;s,8")]] > nVTA } < 2+ e)exp(-2). 
cEX_w: Zw(c)[k]=1 
(5.17) 
Having (5.17) we finish the proof of Lemma 5.9 using union bounds. More precisely, 
[=i ) 
P | [VE] > 1A] < 
(1v > EA) < 


IA 


1 [A 
<P] = m ` . A -s)Q : 1 = 
— T weV eee 1ek<d-1 AG S, S ) tule s) ale S, S )] > n T 


cEX_w: Zw(c)[k]=1 


< 2d(d — 1)P ` [nw(c; s,s’) — tulc; s)Qu(c s,s’)]| > nVTA] < 


cEX_w: Zw(c)[k]=1 
< 2d(d — 1)(2 + e°) exp(—log(K/e)) = e. 


Therefore, we focus on proving (5.17). The proof of this inequality is based on the mar- 
tingale arguments, so we make the dependence on the time explicit in (5.17), that is 
Ny(c; s,s’) and tulc; s) become n? (c; s,s’) and t? (c; s), respectively. 


For 7 € [0, T] we define a process 


M(T) = >D [ni(G 8,8") — tulc 8)Qu(c s, 8°)]. (5.18) 


cEX-w: Zw(c)[k]=1 
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We use the upper index 7 in n7,(c;s,s’) and t7,(c;s) to indicate that these quantities 
correspond to the time interval [0,7]. Using Proposition 5.8 to each summand in (5.18) 
we obtain that the process {M(r): 7 € [0,7]} is a martingale. Let us define its jumps by 


AM(r) = M(r) — M(7_) = >D [X(7_) = (s,€), X(T) = (8',0)], 


cEX_w: Zw(c)[k]=1 


where M(r_) is the left limit at 7. By Theorem II.37 of Protter (2005) and Theorem 
1.4.61 of Jacod and Shiryaev (2003) for any x > —1 the process 


E,(7) = exp(#M(r)) [[a + x2AM(u)) exp(—zAM(u)) 


= exp{M(r) — (x — log(1 + ene ab 


is a local martingale, where n7 y = J eex ,: Zw (0)[KJ=1 na (c; s,s’) is computed for a trajec- 
tory at the time interval [0,7]. Therefore, by Markov inequality together with the triangle 
inequality we get for any x € (0, 1] 


P(|M(T)|>L) < P(|zM(T) — (x —log(14 2))n7,,| > 2L/2) + 


s,s! 


+ P((a—log(1+2))nj y > L/2) < 


IA 


2exp (=) + P((x — log(1 + £))}n? y > xL/2). 


We observe that i, is bounded from above by the total number of jumps up to time T, 
which in turn is bounded by a Poisson random variable N(T) with the intensity TA. 
Hence, again by Markov inequality we have 


P((x —log(1+ 2))n?,, > xL/2) < exp = +TA ( -= \)] 
2 Lz 


Applying an inequality e” < 1/(1— x) for x < 1 and setting x = 1/V/TA we get 


-L TA 
P((a =bg +r)n: s > 2L/2) < | i 
(E= 10g +2) ny > 21/2) < exp (Se + ma 


We use TA > 2 and we plug in L = nVTA to conclude the proof. 


The next lemma is a direct application of Theorem 3.4 of Lezaud (1998) and it will 
be used in the second crucial auxiliary Lemma 5.11. 


Lemma 5.10. For any w E V, 8 E€ Xw, Cs, E Xs, we have 


16 + 207(cg,,,0;8) ] ` 


Proof. Fix w E€ V, s E€ Xu, Cs, E Æsa- By the definition we have 


1 
p (Fieles 0; Jarat 5/2) siik (- 


tC, 0; 5) = / X(t) = (csu, 0; s)| dt. 


Let us define f(X(t)) = m(cg,,,0; 5) — I(X(t) = (cg,,,0,s)). Taking y = T(cs„, 0; s)/2 in 
Theorem 3.4 of Lezaud (1998), we conclude the proof. 
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Lemma 5.11. Lete € (0,1), € > 1 be arbitrary. Suppose that F(E) defined in (5.7) is 


positive and 


6 (23 ja i) log 2 + log (dlvlla/e) 


T > (5.19) 
: 2 7 , 
pi min, 7(cs,,0;s) 
CSy EX Sy 
then 
P(E = GF(€)) 21-<, 
where Co = mo N T(Cg,,, 0; 8)/2. 
CS EX Sy 


Proof. By the definition of F(€), the equation (5.7) and the formula for Hessian of £ (see 
(5.15)) we have 


1 
—— > — min tw(Csa, 0; 8). (5.20) 
F(E) T WEV,8,CSw EX Sy 


We complete the proof by bounding the right-hand side of (5.20) from below. First, we 
can calculate that 


1 
min —tyl(csg,0;5s) > > 
(suck cay T w(Csw: 0; 8) = o) 2 


1 
> P (Vuevsetuesets, ptulcsw, 0; s) = T (CSu 0; s)/2) = (5.21) 


1 
>1-2d max gi Sulp (Files, 0; s) < m(cg,,, 0; 5/2) . 


WwWEV,sEXw,CSy EXSw 


Using Lemma 5.10 we bound (5.21) from below by 


1 — 2d max giSel ||v||2 exp ( 


WEY, se Xw Cs, EXSw 


16 + 207(cg,,,0;8) ] ` 


Applying (5.19) we conclude the proof. 


Next we state and prove three lemmas, where Lemmas 5.12 and 5.14 will be used in 


the proof of the main result Theorem 5.5 and Lemma 5.13 is needed to prove Lemma 5.14. 


Lemma 5.12. Let 8 = 6 — B, z* = ||VE(8)loo. Then 


(A= 2)|[Bsell < BT [VAA - Ve(B)] + A= 2) [seh < A+2)slh- (5-22 


Besides, for arbitrary € > 1 on the event 
cal \ 
Qi = 4 ||Ve co í — À 
1= AIVO < E 


the random vector B belongs to the cone C(E, S). 


The proof of Lemma 5.12 is similar to the proof of Lemma 3.1 of Huang et al. (2013) 
and is based on convexity of (0) and properties of the LASSO penalty. For convenience 


of the reader we will provide it here using our notation. 
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Proof. Since (0) is a convex function, then 
BT | ves) - ve(s)| = BT [vee + 8) - ve(s)| > 0, 


which instantly proves the left-hand side inequality in (5.22). As we have already men- 
tioned, the minimized target function, given in (5.5), is also convex because of the convex- 
ity of the negative log-likelihood (0) and ¢;-penalty functions. Hence B will be a minimizer 
in (5.5) if and only if the following conditions are met 


FN) L A sen(8,), if By £0. 
oe (5.23) 
ONE) 2 5 if ô =0 | 
dB; ZIN J ’ 
where j € {1,...,2d(d — 1)}. First, we can write 
a [va +A) Vea] = AME ae ae B+) arve): 
j j 


jEese 


Since 5; = Ê; for j € S°, then applying the conditions (5.23) we can bound the last 


expression from above by 


NO ÊA sga(ĝ;)) + XC BIA + Blliz* = XO ABI + Iõslhà + 2*llőslla + 2*llĒselh 


JES jes jese 


1 + (àA + 2*)l|ősl|ı meaning that the right-hand side 
aeb) _ 
A ~ A 06; A 

the set S° N {j : B; # 0}, because 8; = 8; — 8; = 0, when j € S° and 8; = 0. Finally, 
by (5.22) and the definition of Q4 we obtain 


This in turn equals to (z* — A)||Gge 


inequality holds as well. Note that we used the fact that —X sgn(G;) only on 


\|Bsella < < |Bsll1 
proving the last claim of the lemma. 
Lemma 5.13. For any b € RPD) we define œ = max exp ees" al Zw(c)|). 


wey, s#s’, cEX_w 
Then we have 


cz bT V20(B)b < b"[VE(B +b) — VE(B)] < eb" V2E(B)b (5.24) 


and 


c 'V70(B) < VUB +b) < &V72(8), (5.25) 


where for two symmetric matrices A, B the expression A < B means that B — A is a non- 
negative definite matriz. 
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Proof. First we prove the inequality (5.24). By (5.14) we have 
b' [VEB +b) — Ve(s)| = 
= 2 E Y tolas) (7 Zul) exp (Ge Zol) [plt Ze(e)) 1]. (6.26) 


WEY cEX_w SAS 
Moreover, as in (5.15), we have 
1 2 
Tyw2 = : w T w T 
bTVU(B)b = = yD Ge) Ee Ao) epee Zale): (5.27) 
WEV cEX_y SES 
Let us consider an arbitrary summand in (5.26) and the corresponding one in (5.27). We 


can focus only on cases where t,,(c;s) > 0 and bY,,'Z,(c) # 0. From the mean value 


s,s! 


theorem we obtain for all non-zero x € R 


e 
ŽO < ll, 5.28 
ao = (5.28) 


el < 


So using (5.28) we can write 


expt" Zu(0)) = 1 
be | Ze) 


s,s! 


exp(—|b%y | Zw(e)|) < < exp(|bS s Zw(c)])- 


Finally, we multiply each side by the expression tw(c; r Zar epit Zalo) to 
conclude the proof of (5.24). Similarly we can prove (5.25). Finally, for any vector 


x € R24!) we have 


TUPAS D YL telas) (124 Zul)” exp (Ge Zel) 


br cEX_y s/s 


and 
a! VUB + b)x D9 5 So Gs) (xy Zolc)) exp(Bey' Zole) exp hy Zw(c)). 
L cEX_w SAS 


Comparing each summand separately and taking into account the definition of c, and 


the inequality (5.28) we finish the proof. 


Lemma 5.14. Let € > 1 be arbitrary and assume that F(€) > 0. Moreover, let us denote 


E e and the event Qz = {r < e™'}. Then Qı NQ C A, where 
i 2Ee"A l 
A= — oœ > FE IDeA 
fi- ale < AE 


andn < 1 is the smaller solution of the equation ne™” = T. 


Proof. The proof is similar to Theorem 3.1 of Huang et al. (2013) or Lemma 6 of Miaso- 
jedow and Rejchel (2018). Suppose we are on the event Qı N Qə. Denote again B= B- B, 


so by the previous lemma we have 0 = E€ C(€,S). Let us consider the function 


4 
lla 
g(t) = 0'Ve(8+t0) —O'Ve(B), t>0. 
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This function is non-decreasing, because the negative log-likelihood function is convex. 
Hence, for every t € (0, ||3||1) we have g(t) < g(||G||,). By Lemma 5.12 on the event Q, 


we obtain 


2AE 
F 
IT [VAB +18) = VUB) + sela < E Osl (5.29) 
Using Lemma 5.13 with b = t0 and (5.24) we obtain 
to" [VEB + t0) — Ve(B)| > P exp(—t)0' V7E(B)8, (5.30) 


in this case cy = C < exp(t). Now using the definition of CIF F(€), the fact that 0 
belongs to the cone C(€,S) and applying the bounds (5.29), (5.30) we get 


i < oe < 6" [VEB + t0) — VE(B)] < 
< DAE 
~ Sit 1 


= 2A |||. — Ta ACE + 1)|lðs|l1/2- 


This means that for any t satisfying (5.29) we have 
(E+ DISIA _ 
2F(§) 


Since, as we mentioned, the function g(t) is non-decreasing, the set of all non-negative t 


F 
t exp(—t) 


rare Pel- alls js 


texp(—t) < (5.31) 


satisfying (5.29) is a closed interval [0, t] for some t > 0. Hence, (5.31) implies t < n, 
where 77 is the smallest solution of the equation ne” = r. Now from (5.29) and (5.30) we 
obtain 
e" < tet < texp(—t)0" VLB) < a" [VE + t0) — Ve(8)] < 
P(E) |r| lloll F(€)|Orl|1[1llo0 
2NE 2N€ (Bll 
~ (EF DFOMAle E+ DFO 


which finishes the proof. 


Proof of Theorem 5.5. Fix arbitrary € > 0 and € > 1. Then F(E) is positive by Lemma 5.7. 
Thus from Lemma 5.11 we know that P (F(€) > G@oF(€)) > 1 —e. Using it with the right- 
hand side of (5.9) we obtain that P(Q2) > 1 — e. Moreover, from Lemma 5.9 we have 
that P(Q,) > 1 — £. Therefore, Lemmas 5.12 and 5.14 (with 7 = 1 for simplicity) imply 


the inequality 
a 2€er ) 
P — blo < —— ] © 1-2¢. 
(1 AR FO 


Finally, we bound F'(£) from below by oF (£). 


Proof of Corollary 5.6. The proof is a simple consequence of the uniform bound (5.10) 
obtained in Theorem 5.5. Indeed, for arbitrary w € V, s # s’ and j-th coordinate of 
the vector 6Y,, such that 6%, (j) = 0 we obtain 


s,s! \J 


Be =i =pl E = Al < 6 
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Analogously, for each w € V, s # s’ and j-th coordinate such that 8#, (j) 4 0 we have 
G D| > zs) = êL (i) — ssj | > Bmin — IÊ — Bllo > 2ô — R > ô, 


which concludes the proof. 


Proof of Lemma 5.7. Fix € > 1. For each w and cs„ we have Zw(cs„,0) = (csa, 0), so 


exp (Crna) l 
PE) E T pap aap > Test: Naess 


WEV S'ES cs, EX s,, 


where (62y) g, and (83:51) 5, 


respectively. Therefore, we need to bound from below the expression 


DE DY exp ((6%y)8,¢5.) [C%)5,, C50] 


WEV S'ES CSu EXSw 
ZS Fal ales 


for each 6 € C(£, S) and 0 ¥ 0. First, we restrict the third sum in the numerator of (5.32) 


to the summands corresponding only to vectors e, € Æs„ having 1 on the coordinate 


are restrictions of 6Y,, and @¥., to coordinates from Sw, 


(5.32) 


corresponding to the node u € S, and 0 elsewhere. Then we reduce the numerator 
of (5.32) to the following form 


ye Hp exp (b2 (u)) [02y (u ie (5.33) 
wey s'/#As uCSw 
Recall that Su = {u € —w : pY (u) #0 or Bi'o(u) #0}. Therefore, if 3%”,,(u) #0, then 
u E€ Sw, so the sum (5.33) can be bounded from below by 
` ` ` exp (82 (u)) [oe]? f (5.34) 


wEV s'#s u:B™ ,(u)#0 


because (5.33) has more summands and the summands are nonnegative. Using reverse 
Hölder’s inequality we replace (5.34) by 
2 


Az" es > y g(t | ’ (5.35) 


wEV s'#s u:B™ ,(u)#0 


A4=) > 2 exp (—6s s (u)) - 


wEV s'#s u:B® , (u)40 


where Ag is 


Next, recall that S is the set of nonzero coordinates of 8, so (5.35) is just ||As5||?/Ag. 
Summarizing, the sum (5.32) is bounded from below by 
shi 
Agl lle 
for each 0 € C(€,S) and 0 Æ 0. The vector 0 belongs to the cone C(E, S), which implies 
that ||ðsell < ||Asella < €|J@s||1 and 


(5.36) 


œ) < max(|lðs|lı, Sllðslla), 
which gives us ||8|| < £llôs||ı. Applying it in (5.36), we finish the proof. 


IIFlloo = max(||As]loo; [|As« 
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5.4 Numerical examples 


In this section we describe the details of algorithm implementation as well as the results 


of experimental studies. 


5.4.1 Details of implementation 


We provide in details practical implementation of the proposed algorithm. The solution 
of (5.5) depends on the choice of À. Finding the ,,optimal” parameter À and the threshold 6 
is difficult in practice. Here we solve it using the information criteria (Xue et al., 2012; 
Pokarowski and Mielniczuk, 2015; Miasojedow and Rejchel, 2018). 


First, using (5.12) we write the minimized function in (5.5) as the sum 


— All], = nl —r OZ y 

08) — Allô = EE (peers Z wiv). 

where s,s’ € {0,1}. Therefore, for fixed w € V and s,s’ € {0,1} with s # s’, the cor- 
responding summand is a function which depends on the vector 0 restricted only to its 
coordinate vector 0%, (see notation (5.1)). So, for each triple w and s # s’ we can 
solve the problem separately. In our implementation we use the following scheme. We 
start with computing a sequence of minimizers on the grid, i.e. for any triple w € V, 
s #s' we create a finite sequence {A;}*_, uniformly spaced on the log scale, starting from 
the largest A;, which corresponds to the empty model. Next, for each value A; we compute 
the estimator ae li] of the vector BY, 


B= argmin {02a +All}, (5.37) 
where as in (5.12) 
w w 1 gpw T w T 
ss A) z T 2 [-nw(¢; S, S Oe Zw(C) a ty (C; s) exp (02y Z) $ 


The notation 6 i] should not be confused with 8% (u) introduced before. Namely, 


Be alt li] is the i-th approximation of BY,, while 8Y,,(u) is the coordinate of 3%, corres- 


ponding to the node u. To solve (5.37) numerically for a given A; we use the FISTA 
algorithm with backtracking from Beck and Teboulle (2009). The final LASSO estimator 


pular method of choosing the value of A in the literature (Xue et al., 2012; Miasojedow 
and Rejchel, 2018), i.e 


li*] is chosen using the Bayesian Information Criterion (BIC), which is a po- 


i* = argmin {nee (Bey lil) + log(n)||62, illo} . 


1<%<100 


Here [ie li]llo denotes the number of non-zero elements of 5” [i] and n is the number of 


observed jumps of the process. In our simulations we use N = 100. 
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Finally, the threshold ô is obtained using the Generalized Information Criterion (GIC). 
The similar way of choosing a threshold was used previously in Pokarowski and Mielniczuk 
(2015); Miasojedow and Rejchel (2018). For a prespecified sequence of thresholds 2 we 
calculate 


§* = argmin { nl (808) + log(2d(d — 1)) 1182 lo} 
ED 


where Bue is the LASSO estimator By, after thresholding with the level ô. 


s,s! 


5.4.2 Simulated data 


We consider three models defined as follows. For shortness we denote these models later 
on as M1, M2 and M3, respectively. 


Model 1. All vertices have the “chain structure”, i.e. for any node, except for the first 
one, its set of parents contains only a previous node. Namely, we put V = {1,...,d} 
and pa(k) = {k — 1}, if k > 1 and pa(1) = Ø. We construct CIM in the following way. 
For the first node the intensities of leaving both states are equal to 5. For the rest of 
the nodes k = 2,...,d, we choose randomly a € {0,1} and we define Q;(c, s,s’) = 9, if 
s # |c—a| and 1 otherwise. In other words, we choose randomly whether the node prefers 
to be at the same state as its parent (a = 0) or not (a = 1). Say, the node k prefers to 
be at the same state as the node k — 1. Then if these two states coincide, the intensity 
of leaving the current state is 1, otherwise it is 9. The intensity is defined analogously, 
when the node k does not prefer to be at the same state as the node k — 1. 


Model 2. The first 5 vertices are correlated, while the remaining vertices are independent. 
We sample 10 arrows between first 5 nodes by choosing randomly 2 parents for each node. 
In order to define the intensity matrix, consider the node w € V with pa(w) 4 @ and 
a configuration c € Xpa(w) of the parents states. We denote |c| = 1 if all the parents of w 


are in the state 1, and |c| = 0 otherwise. Next we define intensities as follows 


5 if pa(w) = 9, 

9 if pa(w) 490, s is preferred state and |c| = 1, 
Qu(c, s,s’) = 41 if pa(w) 490, s is preferred state and |c| = 0, 

9 if pa(w) 4 Q, s is not preferred state and |c| = 0, 

1 if pa(w) 4 Q, sis not preferred state and |c] = 1. 


As in the previous model, the preferred state is chosen randomly from {0,1}. In words, 
for every node w with pa(w) 4 @ we choose randomly one state, say 0. In this case, if all 
parents are 1 the process prefers to be in 1 and if some of the parents are 0 the process 


prefers to be in 0. 


Model 3. All vertices have a „binary tree” structure with arrows from leaves to the root. 
So, leaves have no parents, while the inner nodes have two parents, with the exception 


that one node has only a single parent, if d is even. If the node has no parents or its 
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parents have different states, then the intensity of leaving a state is 5. Otherwise, if a node 
has only one parent or both parents are in the same state, then the intensity of leaving 
a state are computed as in Model 1. 


The model M1 has a simple structure, which involves all vertices and satisfies our 
assumption (5.2). The model M2 corresponds to a dense structure on a small subset of 
vertices and does not satisfy assumption (5.2). Another potential difficulty is related to 
possible feedback loops, which are usually hard to recognize. Therefore, we also consider 
model M2+, which looks like M2, but contains the interaction terms and fulfills (5.2). 
The model M3 has slightly more complex structure than M1, but it also satisfies our 
assumption (5.2). 

We consider two cases: d = 20 and d = 50 for all four models. So, the considered 
number of possible parameters of the model (the size of 3) is 2d? = 800 or 5000, respec- 
tively. For model with interactions, number of possible parameters is d?(d + 1) = 8400 
or 127500. We use T = 10 and 50 for all models and we replicate simulations 100 times 
for each scenario. In Table 5.1 we present averaged results of the simulations in terms of 


three quality measures 
e power, which is a proportion of correctly selected edges; 


e false discovery rate (FDR), which is a fraction of incorrectly selected edges 
among all selected edges; 


e model dimension (MD), which is a number of selected edges. 


We observe that in the models M1 and M3 the results of experiments confirm that 
the proposed method works in a satisfactory way. For T = 10 the algorithm has high 
power and its FDR is not large. The final model selected by our procedure is slightly 
too small (it does not discover a few existing edges). When we increase observation time 
(T = 50), then our estimator behaves almost perfectly. 

The model M2 is much more difficult and this fact has a direct impact on simulation 
results. Namely, for T = 10 the power of the algorithm is relatively low with FDR 
also being rather small. The procedure performs slightly better when we take T = 50. 
However, for both observation times the estimator cannot find the true edges in the graph. 
One of the reasons of such behaviour is that in M2 the dependence structure in CIM 
is not additive in parents. This fact combined with possible feedback loops leads to 
recovering existing edges, but having the opposite to the true ones directions. Looking 
deeper into the results for a few examples chosen from our experiments we confirm this 
claim, i.e. the edges between nodes are correctly selected, but their directions are wrong. 
Therefore, we can conclude that in the complex model M2 our estimator seems at least 
to be able to recognize interactions between nodes, which is important in many practical 
problems on its own. The results for M2+ confirm that the performance of our method 


increases, when we consider more complex parametrization with interaction terms. 
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Table 5.1: Results for simulated data. In the model M1 and M3 the true dimension is 19 
for d = 20 and 49 for d = 50. In the model M2 the true model dimension is 10. 


Model d T Power FDR MD 


M1 20 10 0.86 0.03 16.8 


50 1 0.02 19.3 
50 10 0.61 0.01 30.3 
50 1 0.01 49.3 


M2 20 10 0.16 0.2 2 
50 0.78 0.04 8.1 

50 10 0.10 0.15 1.28 

50 0.62 0.02 6.4 


M2+ 20 10 0.35 0.1 39 
50 0.9 0.02 92 

50 10 0.17 0.08 2 

50 0.68 0.01 6.9 


M3 20 10 0.17 01 3.7 
50 0.97 0.01 18.7 

50 10 0.6 0.09 3.2 

50 0.88 0.003 43 


5.5 Extension of the results 


In this chapter we proposed the method for structure learning of CTBNs. We confirmed 
the good performance of our method both theoretically and experimentally. To simplify 
the notation and help the reader to follow our reasoning we restricted ourselves to graphs 
with only two possible states for each node. However, our results can be generalized 
in a straightforward way to any finite graphs by extending to other possible jumps 
and possible values of parents. In terms of the explanatory variable, it is equivalent to 
the standard encoding of qualitative variables in linear or generalized linear models. To 
demonstrate the generalization more clearly we present an example similar to Example 5.1 
presented in the very beginning of this chapter. 

Example 5.15. We consider CTBN with three nodes A, B and C. Let their state spaces 
be ¥4 = {0,1,2}, Xg = {0,1,2,3} and X% = {0,1,2}, respectively. Then for the node A 
we define the function Z4 as 


Za(b,c) = [1, (b= 1), (b = 2), (b = 3), (c= 1), (c= 2)" 


for each b € {0,1,2,3} and c € {0,1,2}. Analogously, we can define representations for 


the remaining nodes: 


Z(a,c) = [1, I(a = 1), I(a = 2), I(c = 1), We = 2)|' 


for each a € {0,1,2} and c € {0,1,2} and 


Zo(a,b) = [1, (a= 1), (a = 2), (b= 1), (b = 2), (b=3)]" 


for each a € {0,1,2} and b € {0, 1, 2,3}. 

Therefore, for each node w and for each configuration of parents’ states (e.g. for the 
node A and values in the nodes B and C) the value of the function Z,,(-,-) is still a binary 
vector with the dimension equal to the sum of the numbers of states in all parents nodes 
with subtracted number of nodes and added 2. In this example the parameter vector (5.1) 
is defined as 


A pA pA pA QA BA QB QB QB B oB QB QC cC\T 
p — (2643 Bio Bo.2, Peo, Pie, b21» Bors Bros Boe, ca , b315 Bo 35 P30, Bos ia Bea) : 


With a slight abuse of notation, the vector oi is given as 


Be, = [64,(1), 24 (B = 1), b4 (B = 2), 63,(B = 3), 6&4 (C = 1), BA, (C = 2)]", 


and we interpret it as follows: if all 6!,(B = 1), 894(B = 2) and 66 (B = 3) are equal 
to 0, then the intensity of the change from the state 0 to 1 at the node A does not 
depend on the state at the node B. Similarly, the coordinates {3!,(C = 1) and 66 (C = 2) 
describe the dependence between the above intensity and the state at the node C, and 
Beh (1) corresponds to the intercept. For the node B the coordinates of the vector bP, E 
[C23(1), 8B (A = 1), 8P (A = 2), 8P, (C = 1), BP3(C = 2)] describe the relation between 
the intensity of the jump from the state 1 to 3 at the node B to the intercept, states at 
the nodes A and C, respectively. 


Our results can be also easily generalized to the case, where we consider not only 


additive effect in (5.2), but also interactions between parents. Let us again use an example. 
Example 5.16. As previously we consider CTBN with three nodes A, B and C with 
corresponding state spaces ¥4 = {0,1}, ¥g = {0,1,2} and Xo = {0,1}. In a linear 


model we have the following Z,, functions: 


1),1(b = 2), Me = 1)]', 
Zpg(a,c) = [1,I(a = 1), I(c = 1)] , 
1 


for each a,c € {0,1} and b € {0,1,2}. Then after we add pairwise interactions to 
the linear model the functions above take the form 


For models with more nodes we can also take into account more complex interactions. 
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Chapter 6 


Structure learning for CTBNs for 
incomplete data 


In the previous chapter we considered the case when we observe CTBN at each moment of 
time. Under this assumption we introduced a novel method of structure learning. In this 
chapter we show that our method can be adapted to partially observed and noisy data. 
In the case of partial observations we need to introduce the observation and the likelihood 
of the observed data given a hidden trajectory of a process. We can again parametrize 
CIM by (5.2). However, in this case the problem (5.5) becomes more challenging and 
leads to the following two problems. First, the theoretical analysis becomes more chal- 
lenging because the loss function is not convex. Second, the likelihood function can not be 
calculated explicitly, hence, it is difficult to obtain from the computational perspective. 

In our solution we formulate the EM algorithm for this case, where the expectation step 
is standard and concerns the calculation of the expected log-likelihood. The maximization 
step is performed in the same way as for the complete data. Since the density belongs to 
the exponential family, the E-step requires to compute the expected values of sufficient 
statistics, which is done with the MCMC algorithm developed in Rao and Teh (2013). In 
addition, the results from Majewski et al. (2018) combined with Miasojedow and Niemiro 
(2017) are used in the analysis of the Monte Carlo scheme. 


6.1 Introduction and notation 


Let t = (to, t1,.--,t,) with 0 = to < ti < ... < tn and S = (So, S1,..., Sn) describe 
the full trajectory X of the process on the interval [0, T] (t denotes times of jumps, S is 
a skeleton, where S; € X is a state at the moment t;). Let X denote the set of all possible 
trajectories of the process. Let us consider the case when instead of observing the full 
trajectory X we have access only to the partial and noisy data Y with the conditional 
density p(Y | X). More precisely, we assume that Y is represented by the observation 
of X at times 19°,...,¢2°° with the likelihoods g;(S;+) for j = 1,...,k, where 


jt = max{i : t; < i): 
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We assume that 0 < C < gj < Č for 1 < j < k. In this case the full density is given by 
pe(X,Y) = pY | X)ps(X), 


where pg(X) is given by (2.8). Observe that the dependence of pg(X) on 8 is mediated 
through our assumption (5.2), which can be inserted into (2.9). For the clarity of pre- 
sentation we assume that p(Y | X) is known, however, the adaptation of our method to 
the case where p(Y | X) depends also on some unknown parameters is straightforward. 


The negative of the log-likelihood function in this case is given by 


TOE (f px nax) . 


where symbol dX means the summation first over all possible numbers of jumps of the tra- 
jectory X, then over all possible jumps, and the integration with respect to times of jumps. 
More precisely, 


t2 t3 


fiw jax = ae eG f fost (tri Sige Sy dbx ial 


n=0 S1E¥ SnEX 


Again we can define the estimator of the parameter vector (, as previously, 


Ê= argmin {¢(0) + Allêll}. (6.1) 
§ER24(d-1) 

Since we are not able to compute ¢(@) analytically, we need to propose an efficient algo- 
rithm for finding B . One of the efficient algorithms of solving complex optimization prob- 
lems of the form (6.1) is the projected Proximal Gradient Descent (p-PGD) algorithm 
(see for example Beck and Teboulle (2009) and Majewski et al. (2018)). For a closed 
compact convex set K by [[,(a) we denote the projection of a onto K. Then p-PGD is 
defined iteratively by 


Brey = Il. (prox,, aj), (Bre — WV L(G) , (6.2) 


where {7 $50 is a sequence of step-sizes, and “prox” denotes the proximal operator 


defined for any convex function g by 


. 1 
prox, g(a) = argmin ( g) + =l = el?) ; 
y Y 


In the case of 44 penalty, i.e. g = A|| - ||1, the proximal operator is just a soft-thresholding 


operator. Element-wise soft-thresholding operator S) : R” — R” is defined as 


Sy(xi) = [lz] — A]4 sen(zi), 


where [-], denotes the positive part. 
In our case we are not able to evaluate the gradient V£ explicitly and we will use 
stochastic version of the projected proximal gradient algorithm (p-SPGD), where V£ 
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is replaced by its Monte Carlo approximation. Under the regularity conditions given 
for example in Assumption AD.1 in Douc et al. (2014) we derive that the gradient of 
the negative log-likelihood is given by 
V J pa(X,¥)dX _ 
J Pe (X , Y)dX 


Ve(8) = -V log (Jox vax) = 
_ _ J pa(X,Y)V log(pe(X,Y))dX 


J pe(X,Y)dX 


= -f V log(pa(X, Y))pa(X, Y) 
J p(X, Y)dX 


n- (6.3) 


_ f Vlog(ps(X,Y))pa(X | Y)dX = 


= —E(V log(pg(X,Y)) | Y). 


This equation is sometimes referred as Fisher’s identity. Now based on (6.3) we can 


approximate V£ by 


1S ; 
(8, Xk) Se V1 bare 6.4 
(6, , , ) m 2 og (pa( ’ Ys ( ) 
where X',...,X™ is a set of subsequent states of the Markov chain with the stationary 
distribution mg = p(X | Y) x pg(X,Y), where in particular each of X',...,X™ is 
a trajectory of the process X. To generate this Markov chain we will use the procedure 
described below. 


6.2 Sampling the Markov chain with Rao and ‘Teh’s 


algorithm 


Consider the set M of all possible intensity matrices of Markov jump processes with 
the state space ¥ equipped with some matrix distance. Therefore, for any Q € M and 
s € X each element Q(s,s) on the diagonal of Q is nonpositive, and otherwise it is non- 
negative. Let £ C M be a compact set and choose n > ap max Q(s), where we used 


the notation Q(s) = —Q(s,s) introduced in Section 2.5. To generate the Markov chain 
parametrized by Q € £ we will use Rao and Teh’s algorithm, which uses the idea of 
uniformization and the notion of virtual jumps (Rao and Teh (2013)). A virtual jump in 
a trajectory described by a pair of time points t and a corresponding skeleton S means 
that for two subsequent time points t; and t;,, we have S; = $;,;. Simply put, it means 
that the process can jump from a certain state back to the same state. Here we provide 
a comprehensive description of a single iteration of the algorithm. Given an arbitrary 
trajectory (t, S) of X, such that Y(t) = X(t%s) for 1 < i < k, we generate another 
trajectory (t, 8) with this property using the following procedure: 


1. We generate times of virtual jumps v from piecewise homogeneous Poisson process 


with the intensity n — Q(X(t)), which means that for every interval [¢;,t;,1) we 
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sample a number k; of virtual jumps from the Poisson distribution with the para- 
meter equal to (7 — Q(S;))(tiz1 —t,); then the times of virtual jumps are uniformly 
distributed on [t,_1, t;). 


2. We add virtual jumps to the trajectory in a correct order and we obtain new times 
of jumps t’ = tUv and a new corresponding skeleton S’ = (S6,..., Sy), where n’ is 
the number of elements of t’. Therefore S’ = (So,...,.So,S1,..-,S1,---,Sn,---;Sn) 
with ko instances of So, kı instances of S4, etc. In other words, this skeleton con- 
tains the same jumps at the same time points as S and for every interval [t;, ti41) 


the states in S’ are equal to Sj_1. 


3. We sample a new skeleton § of the size n’ using the standard Forward-Filtering 
Backward-Sampling algorithm (FFBS), which for completeness is provided at the end 
of this chapter in Section 6.5. Thus, the resulting distribution of the skeleton will be 


v(So) Mia P(S 94) 1 95 Sie") 
Es v(So) Mii P(Si-1, $s) i Se) 


where v is the initial distribution (which does not depend on Q), and 


Gis] if s £s’ 


, 


P(s,s’) = “al ) (6.5) 


jsa ifs = s’, 
1) 


where s,s’ € XY. The summation in the denominator is taken with respect to all 


possible skeletons of size n’ containing virtual jumps. 


4. From the trajectory (t’,S) we discard newly acquired virtual jumps (i.e. we re- 
move S; such that S; = Si) and we obtain a new set t of times of jumps and 
the resulting trajectory (t, 8), which describes the desired Markov chain. 


The procedure above describes one step of the algorithm and Rao and Teh (2013) showed 
that both trajectories (t, S) and (€,S') describe the same MJP. Simply put, the Poisson 
rate 7 dominates the leaving rates of all states of the MJP and the new skeleton will 
contain more events than there are jumps in the MJP path. The corresponding trajectory 
is regarded as a redundant representation of a pure-jump process that always jumps to 
anew state. Note, that our new stochastic matrix P defined in (6.5) allows self-transitions 
(we refer to them as virtual jumps), and as 7 increases their number grows as well. These 
self-transitions will be discarded in the final step of the algorithm, which compensates for 
an increased number of events. 

The step of the algorithm is the composition of two Markov kernels. First we add 
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virtual jumps according to the kernel M defined by 


Mè((t, S), (t, S)) = I(T = tU v) 


n—1 Ji+1 
II fia — Q(8:)) (tena — ta) CDt, < via < < Vir < tira) II (Sı = S;) 
i=0 l=ji 


Next we draw a skeleton according to the kernel MB given by 


v(So) Mi P(Si-1, $i) Mi- 95S) 


Mé((é, 8), (E, 5)) = (t = t) = m = a k = ) 
Es {($o) Mi PGi, 5) iag} 


(6.7) 


where n denotes the length of the vector t. The dependence of M3 on Q is hidden in 
the definition of P. 

Note that adding virtual jump times v defines the new skeleton uniquely and we denote 
it by S”. Therefore, since sampling a new skeleton does not change times of jumps, the full 
kernel Mg = MM is given by 


Mo((t, S), (t, 5)) = Mg ((t, S), (E, S”))M3 (Œ, S”), (E, 5)). (6.8) 


The kernel Mg acts on a function f(t, S) as follows 


Mof (t, S) = E [f (E, §) | (t, S)] = 


= > ape F /U {ti < via ee < Vik < titi} X fG, S) Ma((t, S), (t, S))dv. 
i=0 5 


ko=0 kyn—1=0 


where the inside sum is taken with respect to all possible skeletons of the same length 
as t. Further, when it does not cause any confusion, for convenience we often denote any 
single trajectory (t, S) as X, and as X; — the trajectory obtained after i-th iteration of 
the algorithm with the starting trajectory Xo. Then for any two adjacent trajectories X; 
and Xji41 the function Mgf(X;) stands for E[f(Xi+1) | X;].. Moreover, for any trajec- 
tory X let V(X) = V(t, S) = n+ 1 (recall that n denotes the number of jumps on the 
trajectory X described by t and S). 


6.3 Structure learning via penalized maximum likeli- 


hood function 
Recall the assumption (5.2) introduced in the previous chapter 


log(Qu(c, S, 5')) = a Aho) ` 


For a given parameter vector 6 € R?4(¢-)) this defines the intensity matrix Q and this 
pene 


mapping can be regarded as an isometry. So, if 6 belongs to a compact set K € 
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then Q belongs to some compact set £ € M and the construction above is still valid. 
In this case, we will frequently write Mg instead of Mg. Now we introduce the main 
theoretical result regarding the convergence of our algorithm. 


Theorem 6.1. Let K € R?“*) be some compact convex set. Denote as Nx(B) the 


normal cone to the set K at the point 6 


Nx(8) = {a € RD : (a, z — B) for all z € K}. 


Moreover, denote 


S={BEK:0€ VB) +All — Ne}, 
where O||3 ||, denotes the subgradient. Suppose that (€+Al|-||1)(S) has non-empty interior. 


Assume also EV?(Xo) < co. Let the sequence {yk, k E N} satisfy yk > 0, jim Yk = 0, 
—0o 


o0 oO co 
D YS; X [Ye — Yenil < 00. 
k=1 k=1 k=1 


Let {6p} be a sequence generated by the projected stochastic proximal gradient descent as 
in (6.2). Then 


and 


dist(B,, SN K) 22 0 a.s. 
Remark 6.2. 


(1) Obviously, jm Yk = 0 follows from the convergence Xp; Y2 < o0. 


(2) The symbol (£ + A|| - ||1)(S) should be understood as the image of the set S under 
the function h(@) = (0) + Allð lla. 


The theorem is a consequence of Theorem 5.4 of Majewski et al. (2018). It states that 
the sequence of parameter vectors By generated by the projected SPGD algorithm con- 
verges almost surely to a stationary point of the function being minimized, where instead 
of the gradient of the negative log-likelihood we used its Markov chain approximation. 

Before proving the theorem we need a few auxiliary results and some additional no- 
tation. For any function f of the trajectory X and any signed-measure u, we define its 
V-variation by 


lallv = sup |a(f), (6.9) 
IfISV 


where 


wf) = f fdu, 


and the integration is over all possible trajectories of the process. Also we denote 


flv = sup Pt (6.10) 


where supremum is taken with respect to all possible trajectories. 


First we prove three auxiliary lemmas concerning the kernels Mg defined by (6.8). 
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Lemma 6.3. Fix a compact set L E€ M. Then there exist constants pı, p2 € (0,1) and 
bı, b2 < co such that for any trajectory X we have 


sup MgV(X) < piV(X) + by 
QEL 
and 
sup MoV’ (X) < pV?’ (X) + bo. 
QEL 
Proof. The proof is a simple extension of proofs of Lemma 5 and Proposition 6 in Miaso- 
jedow and Niemiro (2017). First we note that in the first step of Rao and Teh’s algorithm 
we do not add any new jumps. This implies that in order to obtain the desired bounds 


on MgV(X) we simply need to bound the expectation EV(X’) of the jumps for the tra- 


jectory X’ obtained on the Step 3. Analogously, instead of bounding MgV?(X) we bound 
EV?(X"). Indeed we have 


MgV*(X) = E[V*(X’) | X] = E[E(V7(X’) | |X" =n' +1) | X] 


and E(|X’| = n’+1|X) <n+1+nT with V(X) =n+1, where |X’| denotes the number 
of states in the trajectory X’. For the trajectory X’ = (t, S”) we have 


n’—1 
V(X’) =1+ X (8) A S1) 
i=0 


Therefore, we get 


n'—1 
V(X") = 1+ 2K bp A 


i=0 


+E X (S; A SaaS, A S41)}- (6-11) 
iF; 

Applying Lemma 2 of Miasojedow and Niemiro (2017) together with the definition (6.5), 
the definition of 7 and assumptions on likelihoods g;(Sj+), for each i = 0,...,n’ — 1 we 
obtain 

P (S; =s | S44=s) > ð; > 0. 
This is a lower bound for the backward transition probability used by the FFBS algorithm. 
An analogous inequality for the forward transition probability is also true. Hence, 


E ( (S; a S34) =E [E( (S; a S141) | Si )] = 
=P (S, Zs |S =8) <1-&, 


(6.12) 


which means that we can bound the second term on the RHS of (6.11) from above 
by 2525 (1 — 6;). Also, for each i + j we have 


(Si A Si)US; Si) < US; A Sia). (6.13) 
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Thus, using (6.12) and (6.13), the third term on the RHS of (6.11) can be bounded by 


n'—1n'-1 n’—1n'-1 
E| XO SUS, Z Sia) | = >> DDE (EWS; 4 Sin) | Sa) = 
es ee? 


j=0 i=0 
tAj 
Combining both bounds we obtain 
n'-1 n’—1n'-1 
BV?(X’) <14+2n'-25) 5 4n'(n'-1)-S SCG K< 
i=0 j=0 i=0, 
iAj 


< (1— ô)(n' +1)? +1< (1-d)(n+1) +6, 


where ô = min{d;} € (0,1) and b is some finite constant. This finishes the proof of 
the second part of the lemma. 

The first inequality can be shown either analogously to the second one or by applying 
Jensen’s inequality. Indeed, 


(MoV (X))? < MaV*(X) < paV*(X) + bə, 
which in turn implies that 


MoV (X) < V/V (X) + b2 < VPV (X) + Vba, 


which concludes the proof. 


Lemma 6.4. If E[V(X0)] < 00, then sup MoV (Xn) < œ. If in addition E[V?(Xo)] < 00, 


n>l 


then sup MoV? (Xn) < 00. 


n>1 
Proof. Recall that X, is the trajectory obtained after n-th iteration of Rao and Teh’s 
algorithm starting from the trajectory Xo. As previously we can consider EV (X,,41) 


instead of MoV (Xn). Hence, by the previous lemma we have 


ELV (Xn41)] = E(E[V(Xn41) | Xn]] = ELMQV (Xn)] < pPEV(Xn) +b, 


where p € (0,1) and b < oo. Then, by iterating this majorization procedure recursively 


we get 
n+1 b 


EIV (Xn < P UEV(X b i < pEV(X a, 
[V(Xn41)] < "HEV (Xo) + dese V(X) + 


Since the RHS of this inequality does not depend on n, then E[V (Xn+1)] is bounded by 
a finite constant. This concludes the proof for the first bound. The second inequality can 


be shown analogously using the bound for supger MgV*(X) in Lemma 6.3. 
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Lemma 6.5. For any compact set L C M there exists C € (0,00) such that for 
any Q,Q € L and all trajectories X we have 


| Ma(X, +) — Mg(X, llv < CVX — QI. 


Proof. For any Q,Q € £ the expression of interest (see the definitions (6.6)-(6.9)) can be 


bounded by the sum of two terms as follows 


sup |Mof(t, S) — Maf (t, S)| = sup |MgMé f(t, S) - MEMZ (t, 9)| 
Ifl<v If|<v 


< sup MEMS f(t, S) — MMZ f(t, s)| + sup EAG S) — MSMif(t, s)| 
IfI<V IFI<SV 


= I, + Tp. 
We can bound I, by 


oo oo n-1 
Dy? De Ji {ti < aa < Vig < tins} S IES 
ko=0 k i=0 5 


n—1=0 


x |Md((t, S), Œ, 8°) M5 ((E, S”), Œ, 5)) - M3((E, S), ES") MEE, S”), Œ, 5))| dv. 


Recall that k; denotes the number of virtual jumps on the interval [t;, t;+1). Since |f| < V 
and for any possible § we have V (E, 5) < 14+n+ 2") ki and > MB ((t, S”), (t, 5)) =1 
S 


(see (6.7)), then we can further bound I, by 


lee) ee) n-1 
soe Ji EE E T 
ko=0 kn—1=0 i=0 


x ( +n+ S x) [mg ((t, S), (E, S”)) — MZ ((t, S), (E, S”))| dv, 


i=0 


(6.14) 


Next, let U denote the set of indices u such that only u-th rows of Q and Q differ. Let 
Qı = Q, Qui = Q and for u € U we define the matrix Quy as Qu with the u-th row 
replaced by the corresponding row of Q. In particular, for u € U we have Q,(8) Æ Qu4i(8) 
for a certain state s and for all states s Æ s we have Q,,(s) = Qu4i(s). Then the expression 
under the integral can be bounded by 


M@ (t, 8), (€,8*)) ~ Mg ((t, 8), € S°))| < 


< J |Ma, ((t, 8), (ES) — Må, (t, 8), Œ S”))|. 


Each term in the last sum can be expressed in the form 
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Mg, ((t, 8), (E, 8°)) — M3, ((t, 8), (E,S”))| = 


ki —(n—Qu(Si)) (tips ti 
= I] (n = Qu(S;)) (tiga — ti)l e70 Qu(Si))(ti41 ti) _ 
i=0 
n—-1 
ki —(n-Qu i) )(tit1—ti) | — 
= (1 — Qur (81) (tiga — ty) | eT O Cen GG) — 
i=0 
n—1 n-1 
= II (ti — a (7 — Qu(Si))* E ERa 
i=0 i=0_ 
S;#Ż5 
2." —(n-Qu(8)) X (ti+1—t:) X ki —(n-Qu+1(8)) X (ti+1—ti) 
= S= = S;= S= 
xin- ouei e a Oh Qul) e Zi! 


Now let x = n= Qu (5), y=- Quy (8), a = 3 ki and b = X (test z ti). Let 
Si=5 S;=8 


also r = max(z,y). Hence, we need to bound the expression |x%e~°” — y*e~"4| for some 


x,y,a,b > 0. From Lagrange’s mean value theorem we have 


d 
zte — y%e | < sup 5, (2% a) lz — y| = sup Jaz* te" — bz a e™]| |x — yl. 
ze (x,y) | 2% z€ (x,y) 
Next we can bound the above supremum by 
sup jaz te — bee ™| < sup max(az* te”, bz%e™). (6.15) 


z€(a,y) z€ (x,y) 


Let us assume first that the first expression of (6.15) is the maximum, then we obtain 
az te? < ai te, (6.16) 


where 7 = min(r, at). In the case where the second expression is the maximum we 


obtain analogously bz*e~” < bñfe™™, where 7 = min(r, +). Using the first assumption 


n—1 
with the corresponding inequality (6.16), the fact that a < ` k; and the fact that the set 


i=0 
of indices U is finite, we can bound I, by 
oo oo n—l1 n-1 
> x. ee ` JII {t; <V << Viki < tis} I] (n _ Q,(S;))" e7 (1 Qu(Si)) (Ci41 ti) 
u ko=0 kn=i1=0 i=0 i=0_ 
S;As 
n-1 n—l1 n—l1 
x [J aie ten ki - +n x) |Qu4i(8) — Qu(8)| dv < 
i=0 i=0 i=0 
S;=5 


<D Qen) E[S (14a 5n)| < 


< UIQ- ÖIGIT + ant + nT + (T)? )= 
= |U|IIQ — QIT (n + 2) + (nT)) = CII — ĜI, 
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where C (n) is a certain linear function of n. 


Using the same technique and the fact that b < T we can bound I, by 


n-1 n-1 
D lQus1(8) - QuE) E (14045 on] <T|U||Q- GI -E (a+ 5n) - 
u i=0 i=0 


= T|UIIIQ — ĜI ++ nT) = Co(n)||Q — ĜI, 


where C3(n) is a certain linear function of n. 
Now we can bound Í, in a similar way as we did for I, by an expression similar 
o (6.14), namely by 


n—-1 
Ss 2 Jil TE ESE 


kn—1=0 i=0 


Ma((t, S), (é S°) > | MSG, S”), (t, 5)) — M(E, S”), Œ, 5))| dv. 
3 


Before we continue, for any possible skeleton S let us denote a few auxiliary functions 


tS 
© 
A 
t 
— 
pas 
M 
T 
= - 


Jl Sk), 


i=1 j=1 


Ro = Lal), Hel) = “Z2. 
S 


Therefore, we need to obtain the bound for 


X |Ho(S) - Ha(S)| = >> 
S 


DEET, Ss 


The initial distribution v and likelihoods g; for all j are the same for different intensity 


Lo(S) Lals) 
re k 


(6.17) 
1 


Re BaN 


matrices Q. Using this fact and “triangle” inequality for two products of positive numbers 


-H sY- me II Yis 


=] i=j+1 
with z; = P(Si—1, Si) and y; = P(Si-1, Si), where P is defined by Q in the same way 
as P defined by Q (see (6.5)), we obtain the inequality 


25 [LolS) = Lal S)|< > Fie ay TTP Pts. ig) I’ S a6 


j=1 i=1 
<“j9- AIOE LPs. = s,) T] Ps. CELTE Q|| -n- [S| 
j=1 Sj; S<j Sy; i=l i=j+1 
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Here we used the assumption that all likelihoods are bounded C < g; < C for 1 <j<k, 
and we locally denoted the number of all possible states of the process as |S|. Note, that 
the sum over all possible skeletons S was divided into three sums: the first one is over 


all possible states of S;, the second - over all possible states of S),...,Sj—-1 and the last 


one is over all possible states Sj41,...,S,. Applying again bounds on the likelihoods we 
easily obtain that for any Q 
1 1 1 
a gunn een 
Ck ~~ Re 7 Ok 


which leads to the fact that the first expression in (6.17) is bounded from above by 
C3(n)||Q — Q]|, where C3(n) is some linear function of n. The second expression in (6.17) 
is bounded by 

1 1 


11 gees |LQ(S) — LalS)| 
Ro Ra 


_ Re Ro 


“k 


Applying two previously obtained inequalities we derive the bound C,(n)||Q — Q||, where 


C,(n) is some linear function of n. Now combining all obtained bounds for I, and Ip we 


conclude the proof. 


Lemma 6.6. For the measurable function V : X > [1,+00) let us denote by 


| Ma(X, +) — Me (X, llv 
V(X) 


Dy (6, 6’) = sup 
X 


the V -variation of the kernels Mg and Mg and let Fo : X + R* be the function such that 
supgex [Flv < 00. Moreover, define 


Fy = X M3 (Fp — 15(Fs)). 
n>0 
Then 
| Mpls — Mg Foly < C{Dy(B8, B) + |F — Følv}. 


Proof. The proof follows the same arguments as the proof of the Lemma 4.2 in Fort et al. 
(2011) in the supplement materials to the paper. In addition, some references to the first 
papers using similar argumentation can be found there. 


First, we use the following decomposition of M f f— ME f for any k>1 
k-1 
MES — MẸ f = > M3 (Mp - Mg) (M377 S - relf) 
j=0 


By the Proposition 7 in Miasojedow and Niemiro (2017) the sets {X : |V(X)| < h} are 
the small sets for any h € R. Therefore, combining it with Lemma (6.3) we have by 
Theorem 9 in Roberts and Rosenthal (2004) that there exist constants Cg and pg € (0,1) 
such that 


IMEX, -) — relv < Cooh V (X). 
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This property is called geometric ergodicity of the kernel Mg with invariant distribution rg. 
Hence, for any k > 1 and any trajectory X, 


lrg — Tarllv 
< Its — M$ (Xx lv + ME (Xe) — MBX, lv + MG) — røllv 
< (Caps + Caph) V(X.) 


+ sup (6.18) 


lflv<1 


Mi (Ms — My”) (MEE — no) X: 


j=0 


We can bound each summand from the sum on the RHS by 


sup Mj |(Mj— Ms )(Mj?"f — mo(f))(%))- 


lflv<1 


Now let us denote H = (Mees —g(f)]. Then the expression within the absolute 
value operator is bounded by 


sup sup |(Mg — Mgr)g(X.)| < sup |H|v sup |(Mg — Mg)g(X4)| = 


Iflv<1 |gl<|A| lflv<1 lgl<V 
MEF X\ — rol f(X 
Z N F(X) — relf) Il M(X.) -Mo(Xe,Ilv < 
lflv<1 X V(X) 


< Corp?» Dy(B, PVX). 
Thus the last term in the (6.18) is bounded by 


k—1 


Cg Dy(B,B') X og * MiV(X,) < 
j=0 
k-1 


< Oy Dy(8,6') Yoo? {me(V) + Cseh V(X,)} < 


j=0 


Dy(B, B") (ma(V) + CV (Xx). 


Taking the limit as k + +00 in the first term in (6.18) we obtain 


Cp 1 
Irs = relly < 75y PVB P’) (malV) + CaVX). (6.19) 


Now from (7) in Fort et al. (2011) we have 


n-1 
MpFp — Mp Fe =X X (M3 — rg) (Mp — Mg) (Mg? Fe — re (Fo)) 


n>1 j=0 


— X_{M} (Fo — Fp) — ng (Fo — Fa)} — X. a{ MBF — nre(Fo)}. (6.20) 


n>1 n>1 
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Let us consider the first term. Similarly to the previous step by G we denote the operator 
G= [M3 1 Fs — Tg (F3)]. Then we can bound 


|(M} — 3) (Mg — Mgr) (M3 Fa — te (F3)) (X)| < 
sup | (Mj — rg) (Mg — Mg) g(X)| < 


< 
AIG 
< Oe san |G — mg) (Mg — Mgr) g(X)| < 
a\< 
< |Glv sup | (M3 — 7) h(X)| < 


|A|<||Mg-Mø llv 


< |Gly Dy (8, B^) | (M3 — 773) h(X)| < 


< Cop's? |FalvDv(B, B’) -Coh V (X). 
For the second and third terms in (6.20) we obtain the bounds 
|M (Fo — Fs)(X) — ro (Fø — Fo)| < Capg V (X)|Fo — Fplv 
and 
[mat Mg Fg — Te (Fo) (X) = |r — rg) {Mg Fs — Talo) }(X)] < 

< |lt2 — rø llv|Mg F(X) — te (Fs)lv < (6.21) 
< lirs — rg llvCg pg |Felv. 

Therefore, combining the inequalities (6.19) — (6.21) we get 

CaCg 
(1 — pe) — pg) 
TVIX) Fp — Falv+ 
Cg 
(1 — per) 


Thus, since supgex |F’3|v < œœ, there exists a positive constant Lg g for which we have 


| Ms F(X) — Mg Fg (X)| < |Fs|v Dv (8, BV (X)+ 


+ 


F |Fslv Dv (8, 8) (ma(V) + CsV (X)). 


|Ma F(X) — Mg Fg (X)| < Lg,gV(X)(Dy (8, 8’) + |F — Falv). 


This concludes the proof. 


The proof of the main Theorem 6.1 is based on Theorem 6.7 which is obtained by com- 
bining Theorem 5.4 and Proposition 5.5 of Majewski et al. (2018) with a slight adjustment 


of the notation to our context. For any compact convex set K by 


Nx(z) = {a € R? : la, z — x) for all z € K} 


we denote the normal cone to K at the point x. We consider an open set B € R? and 


functions f,g : B — R. We assume that f is a continuously differentiable function and 


also for all 6 € K its gradient satisfies 
VEE) = f (8, X Jrad) 
X 
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for some probability measure 7g and an integrable function ©(6, X). 
By {8p, k E€ N} we denote the sequence generated by the projected SPGD: 


Be € IL, (proxy, g(Br-1 — Yr®(Bx-1,€e))) , (6.22) 


where &, is a random variable with 7,,_, distribution. Moreover, by {6,,n € N} we denote 
the gradient perturbation sequence defined by 


ôk = ®(Bx_1, r) — VE(Gx-1)- 


Moreover, for any measurable function W : X — [1, +00) recall the definitions of 
llullw and |flw given in (6.9) and (6.10). Then we define W-variation of the kernels Mg 
and Mg, by 
|| Ma(X, :) — Mg (X, )Ilw 

W(X) 


Dw(86, 6’) = sup 
x 
Theorem 6.7. Denote 


S={8 EK :0€ VF(B) + 09(8) — Nx(8)}, 


where Og is a subgradient of g : B > R (see e.g. Rockafellar (1970)). Suppose that the set 
(f+g)(S) has empty interior and sup ||dx|| < oo. We also make the following assumptions. 
keN 


(1) The function g is convex, Lipschitz and bounded from below. 


(2) The sequence of step sizes {Yg} satisfies yg > 0 and limps. Yk = 0 and 
X n=, X w-wa, Soe < oo. 
k=1 k=1 k=1 


(3) There exist constants p € [0,1) and b < co and a measurable function W : X > 
|1, +20) such that 


sup |® (8, -)|lw12 < œ, sup MaW < AW +b. 
PEK BEK 


In addition, for any l € (0,1] there exists C < œ and p E€ (0,1) such that for 
any X EX 
sup [M3 X, ) — malls < CWO). 
& 


(4) The kernels Mg and the stationary distributions mg are locally Lipschitz with respect 
to B, i.e. for any compact set K and any B, B' € K there exists C < œ such that 


a ®(8,-) — 86, wie + Dyrn(B, 8’) < CIB — B'I- 


(5) E[W(&1)] < œ. 


Then the sequence {xp, k E€ N} generated by iterations (6.22) converges to S. 
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Proof of Theorem 6.1. For better transparency of the proof we will use m = 1 in (6.4), 
the generalization of the reasoning to the case of m > 1 is straightforward. In our case 
the role of the function f plays the negative log-likelihood ¢((6) and the function g is the 
¢,-penalty. Both functions satisfy the assumptions of Theorem 6.7. 

Then, by the formula (5.14) for the gradient of the negative log-likelihood function 
and the Fisher identity in (6.3) the function ®(6;, X) in our case takes the form 


P(B, X) = ` ‘> ` [-nw(c; s,s’) + tulc; s) exp(6z’, s Zw(c))] Zwlc), (6.23) 
WEV cEX_w ss! 
where we take 6; from the k-th iteration of the p-SPGD algorithm as the parameter 
vector. Other components such as nw(c; 5,8’), tw(c; s) and Z(c) correspond to a single 
trajectory X of the Markov jump process. Integrating this function over all possible 
trajectories with respect to mg = pg(Y | X) gives us the desired gradient V0(G) of 
the negative log-likelihood. 

In place of the function W in the assumptions of Theorem 6.7 we take the function V?. 
Note that the original Theorem 5.4 of Majewski et al. (2018) on the convergence of 
the algorithm does not use the function W, instead it has an additional assumption on 
the gradient perturbation sequence 


ôk = (8x1, Xk) — VE(Bx-1), KEN. 


That assumption states that the sequence {d,,k € N} can be decomposed as 6, = e? +ré, 
where {ef, k € N} and {r?, k € N} are two sequences satisfying lim, 5 ||r?|| = 0 and the 
series Yra yke? converges. However, Proposition 5.5 in Majewski et al. (2018) implies 
that by introducing Assumptions (3)-(5) we obtain the required decomposition of dx. 
Therefore let us check the rest of the assumptions of Theorem 6.7. 

Assumption (2) on step-sizes is automatically satisfied. First we review Assumption (3) 


with W = V?, which consists of three conditions. The first condition that sup |®(6x, -)|v 
BkEK 
is bounded is easy to check because for any trajectory X the sum of the terms n,,(c; s,s’) 


is bounded by the total number of jumps V(X), the sum of the terms t,,(c; s) is bounded 
by the total observation time T' and vectors 8, come from the compact set K, which 
means that exponent is bounded by some constant. The second condition follows directly 
from Lemma 6.3. The last condition representing geometric ergodicity was shown in 
Lemma 6.6. 

In our setting Assumption (4) takes the form 


Dy (8, 8’) + |®(B,-) — (8, )\v < CIB — L'I 


for some constant C. We obtain it by combining Lemma 6.5 and the trivial fact that 
|O(G,-) — &(5’,-)|v < C||B — 6’|| for some constant C. 

In the course of the proof of the mentioned above decomposition Majewski et al. (2018) 
used the following property of the function W, which needs to be checked as well. For any 


trajectory €, under the assumption EW (£o) < oo there holds sup E[W (&)] < oo. In our 
k>1 
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case we can obtain the same property. Assuming EV?(Xo) < œ by Lemma 6.4 we have 


that sup E[V2(X;,)] < oo. This concludes the proof of the theorem. 
k>1 


6.4 Numerical results 


In this section we describe the details of implementation of the proposed algorithm as 


well as the results of experimental studies. 


6.4.1 Details of implementation 


We provide in details implementation of the proposed algorithm in practice. Recall 
that the optimization problem (6.1) is solved by the iterative algorithm called projected 
stochastic proximal gradient descent given in (6.2). Instead of the gradient of the ne- 
gative log-likelihood Vé(@) we use its MCMC approximation 6(3,X1,...,X’), where 
X',...,X™ is a set of trajectories generated by Rao and Teh’s scheme given in Sec- 
tion 6.2. The solution of (6.1) depends on the choice of A. As we mentioned in previous 
chapters, finding the ,,optimal” parameter À and the threshold 6 is difficult in practice. In 
this case we also solve it using the same information criteria as in Chapter 5, where again 
instead of the gradient of the negative log-likelihood we use its MCMC approximation. 

The function ®(6,X',...,X™) is an average of the functions ®(0, X’) introduced 
n (6.23) (recall that we use the symbol 8 only for the true parameter vector and 0 
otherwise). Now, in the analogous way as we divided the optimization problem (5.5) in 
Subsection 5.4.1 we can divide the current one. Namely, for fixed w € V and s,s’ € {0,1} 
with s Æ s’, the corresponding summand in ®(6, X',..., X™) is a function which depends 
on the vector @ restricted only to its coordinate vector 0%, (see notation (5.1)). So, for 
each triple w and s Æ s’ we can solve the problem separately. Let us denote these 
summands of (6, X',...,X™) as Py (02y). 

Hence, in the current implementation we can use the scheme from Subsection 5.4.1. 
Namely, we start with computing a sequence of minimizers on the grid, i.e. for any triple 
w € V, s Æ 3' we create a finite sequence {\,;}_, uniformly spaced on the log scale, starting 
from the largest \;, which corresponds to the empty model. Next, for each value A; we 
compute the estimator BY y [i] of the vector BY 


s,s! 


che [i | = K argmin {0t ot ( )+ Ài los alla }. (6.24) 


OS y 


The notation 6 ’, [i] means the i-th approximation of BY,,. To solve (6.24) numerically 


for a given A; we use the SPGD algorithm without the poii onto the compact set. 

In practice, the algorithm still converges well so we did not use the projection. The final 

LASSO estimator ĝ” y = Be ’,|t*] is chosen using the Bayesian Information Criterion (BIC) 

applied to the MCMC approximation of the gradient of the negative log-likelihood, i.e. 
i* = argmin {n0t (0!) (Be yi) + lou(n) Bylo 


1<i<N 
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Here || Be li]||o denotes the number of non-zero elements of Be, li] and n is the number of 
jumps in the trajectory generated by Rao and Teh’s algorithm. In our simulations we use 
N = 100. 

Finally, the threshold 6 is obtained using the Generalized Information Criterion (GIC) 
as in Subsection 5.4.1, also applied to the MCMC approximation of the gradient of the ne- 
gative log-likelihood. For a prespecified sequence of thresholds Y we calculate 


§* = argmin {nt ($22) + log(2d(d — 1)) 12S lo} 
ED 


where ? is the LASSO estimator Be after thresholding with the level ô. 


s,s! 


6.4.2 Simulated data 


We consider the chain model analogous to the model M1 in Subsection 5.4.2. All vertices 
have the “chain structure”, i.e. for any node, except for the first one, its set of parents 
contains only a previous node. Namely, we put V = {1,...,d} and pa(k) = {k — 1}, if 
k > 1 and pa(1) = Ø. We construct CIM in the same way as in Subsection 5.4.2. Namely, 
for the first node the intensities of leaving both states are equal to 5. For the rest of 
the nodes k = 2,...,d, we choose randomly a € {0,1} and we define Q;(c, s,s’) = 9, if 
s # |c—a| and 1 otherwise. In other words, we choose randomly whether the node prefers 
to be at the same state as its parent (a = 0) or not (a = 1). 

We consider two cases with the number of nodes equal to d = 5 and d = 10. So, 
the considered number of possible parameters of the model (the size of 8) is 2d? = 50 
or 200, respectively. We use T = 10 for 5 nodes and T = 20 for 10 nodes. We replicate 
simulations 100 times for each scenario. As the partial observation we take 100, 200 and 
400 equally spaced points for 5 nodes and 200, 400 and 800 for 10 nodes. In Figure 6.1 


we present averaged results of the simulations in terms of three quality measures 
e power, which is a proportion of correctly selected edges; 


e false discovery rate (FDR), which is a fraction of incorrectly selected edges 
among all selected edges; 


e true model (TM), which is an indicator whether the algorithm selected the true 


model without any errors. 


In Figure 6.2 we provide the results of simulations for the same models in case of com- 
plete trajectories. We observe that the results of experiments confirm that the proposed 
method works in a satisfactory way. We observe that with increasing number of observa- 
tion points results are close to the ones in case of complete data. The larger the number 
of points the higher the power of the algorithm and tends to 1. The FDR is quite low in 
all cases. For the half simulations in case of 10 nodes and the time T = 20 the algorithm 


discovers the true model when we choose a big enough number of observation points. 
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Figure 6.1: Results of simulations for partially observed data. 
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Figure 6.2: Results of simulations for fully observed data. 
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Number of nodes 
-s5 
10 


—— 20 


6.5 FFBS Algorithm 


For completeness of the proposed scheme we provide the description of the forward- 
filtering backward-sampling algorithm for discrete-time Markov chains taken from Rao 
and Teh (2013) with a slightly changed notation. Earlier references for the FFBS algo- 
rithm can be found there as well. 

Let (So,...,. Sn) be a discrete-time Markov chain with a discrete state space ¥ = 
{1,...,N}. Let P be a transition matrix P(s,s’) = p(Sj41 = S | S; = s). Let v be 
an initial distribution over states at time point 0 and let Y = (Yo,...,Y;,) be a sequence 
of noisy observations with likelihoods g;(s) = p(Y; | Sj = s). Given a set of observations 
Y = (Yo,.--, Yn), FFBS returns an independent posterior sample of the state vector. 

Define a;(s) = p(Yo,-.-, Yj-1,9; = s). From the Markov property, we have the fol- 


lowing recursion: 


aj41(s') = ) a;(s)g;(8)P(s, s’). 


We calculate this for all possible states s’ € ¥ performing a forward pass. At the end of 
the forward pass we obtain the distribution 


bn(8) = gn(8)an(8) = P(Y, Sn = 8) x p(Sp = | Y) 
and sample S,, from it. Next, note that 
p(S;=s| Sj =S, Y) x p(S; =s, Sj} =S, Y) = 
= a;(8)g;(8)P(s, PY -+ <, Yn | Sj = 8!) 
œ a;(s)9;(s)P(s, 8’), 
where the second equality follows from the Markov property. This is also an easy distribu- 


tion to sample from, and the backward pass of FFBS successively samples new elements 


of Markov chain from S,,_; to So. The pseudocode for the algorithm is given below. 


Algorithm 1 The forward-filtering backward-sampling algorithm 


Input: An initial distribution over states v, a transition matrix P, a sequence of noisy 
observations Y = (Yo,..., Yn) with likelihoods g;(s) = p(Y; | Sje = 8). 
Output: A realization of the Markov chain (Sọ,..., Sn) 


Initialize ao(s) = v(s). 
for j =0ton—-1 
Gita) = 2 a;(s)g;(s)P(s,s’) for s € X. 
Sample Sn ~ bn(-), where bn = gn(s)an(s). 
for j =n— 1to0 

Define b;(s) = a;(s)g;(s)P(s, S541); 
f Sample S; ~ b;(-). 


return (So,..., Sn) 
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Chapter 7 


Conclusions and discussion 


In this thesis we explored two types of probabilistic graphical models (PGM): Bayesian 
networks (BN) and continuous time Bayesian networks (CTBN). First, we explained 
the concept of PGMs and the motivation to study them with a few examples of suc- 
cessful applications. Then, we discussed more thoroughly PGMs of interest describing 
the problems within both frameworks and provided necessary preliminaries. In terms of 
contributions we were focused on structure learning, which is one of the most challenging 
tasks in the process of exploring PGMs and is interesting in itself. We also discussed other 
types of problems and reviewed some previously known results concerning these problems 
to provide some context. 

The problem of structure learning for BNs is difficult due to the superexponential 
growth of the space of directed acyclic graphs (DAG) with the number of variables and 
also because the underlying graph needs to be acyclic. We solve this problem by dividing 
it into two tasks. First, we use a known method called partition MCMC to slice the set of 
variables into layers where any variable in any layer can have parents only from the pre- 
vious layers and has at least one parent from the previous adjacent layer. Second, we find 
the arrows using the knowledge about the layers. In the case of continuous data we use 
the assumption that our network is a Gaussian Bayesian network and hence each variable 
is a linear combination of its parents. Thus, we solve the problem of finding arrows by 
finding the non-zero coefficients in the linear combination of all the variables from previ- 
ous layers using Thresholded LASSO estimator. In the case of discrete and binary data 
we use the assumption that probability of each variable being equal to 1 is a sigmoid func- 
tion of a linear combination of its parents. Hence, again we solve the problem of finding 
arrows by finding the non-zero coefficients in the linear combination of all the variables 
from previous layers using Thresholded LASSO estimator for logistic regression. Finally, 
for the discrete data where each variable has a finite state space we use a softmax func- 
tion instead of the sigmoid function. We demonstrated theoretical consistency of LASSO 
and Thresholded LASSO estimators for the continuous model and showed their effec- 
tiveness on the benchmark Bayesian networks of different sizes and structure comparing 
the proposed method to several existing methods for structure learning. 


The problem of structure learning for CTBNs in the case of complete data is also 
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reduced to solving the optimizational problem for the penalized with ¢;-penalty maxi- 
mum likelihood function. We assumed that a conditional intensity of a variable is a linear 
function of the states of its parents, which can be easily extended to a polynomial depen- 
dence. Starting from the full graph we remove irrelevant edges and estimate parameters 
for existing ones simultaneously in case of LASSO estimator. In case of thresholded ver- 
sion of this LASSO estimator we only learn the structure. We proved the consistency of 
the proposed estimators and demonstrated coherence of theoretical results with numerical 
results from simulated data. 

The last problem considered in the thesis was structure learning for CTBNs in the case 
of incomplete data. The optimizational problem takes the same form as for complete 
data but we cannot write the likelihood function explicitly anymore. Instead of the nega- 
tive log-likelihood function we used its Markov chain Monte Carlo approximation, where 
Markov chain was generated using Rao and Teh’s algorithm. The optimizational problem 
itself was solved by projected stochastic proximal gradient descent algorithm. We proved 
the convergence of this algorithm to the set of stationary points of the minimized func- 
tion. We used the same assumption on conditional intensities as in the case of complete 
data. In practice to discover the arrows we used the thresholded version of the obtained 
estimator. We showed on a small simulated example that the quality of the proposed 
method is similar to the case of complete data and increases with the number of observed 
points per interval. 

As for the future research we want to obtain similar theoretical results for Bayesian 
networks in the case of discrete data as we have obtained for the continuous data. In future 
we intend to perform more experiments and comparisons with existing approaches for all 
proposed methods. For some methods there are no open implementations or there are 
implementations in different programming languages, which makes it difficult to perform 
the comparison. The main goal was to show theoretical value of the proposed methods 
and show that the results of experiments are consistent with theory, which in our opinion 


was achieved. 
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